From janne.peltonen at helsinki.fi  Sun Jul  1 11:17:48 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Sun, 1 Jul 2007 14:17:48 +0300
Subject: [Linux-cluster] Rgmanager fails to restart
Message-ID: <20070701111748.GA9103@helsinki.fi>

Hi!

Sometimes, when I have cleanly shut down rgmanager on one node, and the
services have nicely migrated to other nodes, trying to start rgmanager
fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
device". clurgmgrd concludes that locks are not working and exits.
(See strace output attached.)

--cut--
[jmmpelto at pcn1 ~]$ sudo service rgmanager start
Starting Cluster Service Manager:                          [  OK  ]
[jmmpelto at pcn1 ~]$ sudo service rgmanager status
clurgmgrd dead but pid file exists
--cut--

Trying to stop cman fails:

--clip--
[jmmpelto at pcn1 ~]$ sudo service cman restart
Stopping cluster: 
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
--clip--

And indeed, the rgmanager that isn't there is there:

--clip--
[jmmpelto at pcn1 ~]$ sudo cman_tool services
type             level name       id       state       
fence            0     default    00010001 none        
[1 2 3 4 100]
dlm              1     clvmd      00010002 none        
[1 2 3 4 100]
dlm              1     rgmanager  00020002 none        
[1 2 3 4]
--clip--

If I say 'cman_tool leave force', it succeeds. But if I then try starting the cluster:

--cut--
[jmmpelto at pcn1 ~]$ sudo service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... failed

                                                           [FAILED]
--cut--

Log (oops, I forgot to shut down clvmd there... it would have gone down cleanly):

--cut--
Jul  1 14:11:02 pcn1.mappi.helsinki.fi ccsd[4427]: Initial status:: Inquorate
Jul  1 14:11:28 pcn1.mappi.helsinki.fi groupd[557]: found uncontrolled kernel object rgmanager in /sys/kernel/dlm
Jul  1 14:11:28 pcn1.mappi.helsinki.fi groupd[557]: found uncontrolled kernel object clvmd in /sys/kernel/dlm
Jul  1 14:11:28 pcn1.mappi.helsinki.fi groupd[557]: local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm
Jul  1 14:11:28 pcn1.mappi.helsinki.fi fenced[568]: cman_init error 0 111
Jul  1 14:11:28 pcn1.mappi.helsinki.fi dlm_controld[576]: cman_init error 0 111
Jul  1 14:11:28 pcn1.mappi.helsinki.fi gfs_controld[583]: cman_init error 111
--cut--

Thereafter, one of the other nodes fences this one:

--cut--
Jul  1 14:11:50 pcn1.mappi.helsinki.fi init: Switching to runlevel: 0
Jul  1 14:11:50 pcn1.mappi.helsinki.fi ccsd[4427]: Unable to connect to cluster infrastructure after 30 seconds.
Jul  1 14:11:52 pcn1.mappi.helsinki.fi rgmanager: [667]: <notice> Cluster Service Manager is stopped.
--cut--

(Now I wonder where that rgmanager log line came from? It isn't from any
clurgmgrd, I checked with ps that there were none running.)

Any ideas?

(version of relevant packages:

lvm2-2.02.16-3.el5
cman-2.0.60-1.el5
rgmanager-2.0.23-1.el5.centos

)


--Janne
-------------- next part --------------
execve("/usr/sbin/clurgmgrd", ["clurgmgrd"], [/* 17 vars */]) = 0
brk(0)                                  = 0xc4f2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaab000
uname({sys="Linux", node="pcn1.mappi.helsinki.fi", ...}) = 0
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=45165, ...}) = 0
mmap(NULL, 45165, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2aaaaaaac000
close(3)                                = 0
open("/usr/lib64/libxml2.so.2", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \262B_<\0\0\0@\0\0\0\0\0\0\0\260\303\23\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0 at _<\0\0\0\0\0 at _<\0\0\0\24\"\23\0\0\0\0\0\24\"\23\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0\0000\23\0\0\0\0\0\0000s_<\0\0\0\000"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1297136, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaab8000
mmap(0x3c5f400000, 3395256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5f400000
mprotect(0x3c5f533000, 2097152, PROT_NONE) = 0
mmap(0x3c5f733000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x133000) = 0x3c5f733000
mmap(0x3c5f73c000, 3768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c5f73c000
close(3)                                = 0
open("/lib64/libpthread.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20W\300Y<\0\0\0@\0\0\0\0\0\0\0\330\35\2\0\0\0\0\0\0\0\0\0@\0008\0\t\0@\0\'\0&\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0\300Y<\0\0\0@\0\300Y<\0\0\0\370\1\0\0\0\0\0\0\370\1\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0@\375\0\0\0\0\0\0@\375\300Y<\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=141208, ...}) = 0
mmap(0x3c59c00000, 2200432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c59c00000
mprotect(0x3c59c15000, 2093056, PROT_NONE) = 0
mmap(0x3c59e14000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x3c59e14000
mmap(0x3c59e16000, 13168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c59e16000
close(3)                                = 0
open("/lib64/libdl.so.2", O_RDONLY)     = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \16\200Y<\0\0\0@\0\0\0\0\0\0\0\240R\0\0\0\0\0\0\0\0\0\0@\0008\0\t\0@\0%\0$\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0\200Y<\0\0\0@\0\200Y<\0\0\0\370\1\0\0\0\0\0\0\370\1\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0\240\32\0\0\0\0\0\0\240\32\200Y<\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=23520, ...}) = 0
mmap(0x3c59800000, 2109728, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c59800000
mprotect(0x3c59802000, 2097152, PROT_NONE) = 0
mmap(0x3c59a02000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x3c59a02000
close(3)                                = 0
open("/usr/lib64/libcman.so.2", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p\20\200Z<\0\0\0@\0\0\0\0\0\0\0\240L\0\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\200Z<\0\0\0\0\0\200Z<\0\0\0\34A\0\0\0\0\0\0\34A\0\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0 A\0\0\0\0\0\0 A\240Z<\0\0\0 A\240Z<"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=21472, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaab9000
mmap(0x3c5a800000, 2114456, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5a800000
mprotect(0x3c5a805000, 2093056, PROT_NONE) = 0
mmap(0x3c5aa04000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x3c5aa04000
close(3)                                = 0
open("/usr/lib64/libdlm.so.2", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\27\300Z<\0\0\0@\0\0\0\0\0\0\0\20H\0\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\300Z<\0\0\0\0\0\300Z<\0\0\0L;\0\0\0\0\0\0L;\0\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0P;\0\0\0\0\0\0P;\340Z<\0\0\0P;\340Z<\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=20304, ...}) = 0
mmap(0x3c5ac00000, 2113272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5ac00000
mprotect(0x3c5ac04000, 2093056, PROT_NONE) = 0
mmap(0x3c5ae03000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x3c5ae03000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\331AY<\0\0\0@\0\0\0\0\0\0\0P\211\31\0\0\0\0\0\0\0\0\0@\0008\0\n\0@\0M\0L\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0 at Y<\0\0\0@\0 at Y<\0\0\0000\2\0\0\0\0\0\0000\2\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0\240\257\21\0\0\0\0\0\240\257QY<\0\0\0\240\257"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1678480, ...}) = 0
mmap(0x3c59400000, 3461272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c59400000
mprotect(0x3c59544000, 2097152, PROT_NONE) = 0
mmap(0x3c59744000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x144000) = 0x3c59744000
mmap(0x3c59749000, 16536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c59749000
close(3)                                = 0
open("/usr/lib64/libz.so.1", O_RDONLY)  = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\36 at Z<\0\0\0@\0\0\0\0\0\0\0(G\1\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0 at Z<\0\0\0\0\0 at Z<\0\0\0\3648\1\0\0\0\0\0\3648\1\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0\3708\1\0\0\0\0\0\3708aZ<\0\0\0\3708aZ<\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=85608, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaba000
mmap(0x3c5a400000, 2178600, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5a400000
mprotect(0x3c5a414000, 2093056, PROT_NONE) = 0
mmap(0x3c5a613000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13000) = 0x3c5a613000
close(3)                                = 0
open("/lib64/libm.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200>\0Z<\0\0\0@\0\0\0\0\0\0\0\240X\t\0\0\0\0\0\0\0\0\0@\0008\0\t\0@\0)\0(\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0\0Z<\0\0\0@\0\0Z<\0\0\0\370\1\0\0\0\0\0\0\370\1\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0\260\304\7\0\0\0\0\0\260\304\7Z<\0\0\0\260"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=615136, ...}) = 0
mmap(0x3c5a000000, 2629848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5a000000
mprotect(0x3c5a082000, 2093056, PROT_NONE) = 0
mmap(0x3c5a281000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x81000) = 0x3c5a281000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaabb000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaabc000
arch_prctl(ARCH_SET_FS, 0x2aaaaaabba00) = 0
mprotect(0x3c59e14000, 4096, PROT_READ) = 0
mprotect(0x3c59a02000, 4096, PROT_READ) = 0
mprotect(0x3c59744000, 16384, PROT_READ) = 0
mprotect(0x3c5a281000, 4096, PROT_READ) = 0
mprotect(0x3c59219000, 4096, PROT_READ) = 0
munmap(0x2aaaaaaac000, 45165)           = 0
set_tid_address(0x2aaaaaabba90)         = 393
set_robust_list(0x2aaaaaabbaa0, 0x18)   = 0
rt_sigaction(SIGRTMIN, {0x3c59c05350, [], SA_RESTORER|SA_SIGINFO, 0x3c59c0dd40}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x3c59c052a0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x3c59c0dd40}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0
geteuid()                               = 0
getuid()                                = 0
stat("/var/run/clurgmgrd.pid", {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
brk(0)                                  = 0xc4f2000
brk(0xc513000)                          = 0xc513000
open("/var/run/clurgmgrd.pid", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
read(3, "369", 4096)                    = 3
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x2aaaaaaac000, 4096)            = 0
open("/proc/369", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, ~[QUIT ILL TRAP ABRT BUS FPE SEGV RTMIN RT_1], NULL, 8) = 0
clone(Process 394 attached (waiting for parent)
Process 394 resumed (parent 393 ready)
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaaabba90) = 394
[pid   393] exit_group(0)               = ?
Process 393 detached
setsid()                                = 394
chdir("/")                              = 0
open("/dev/null", O_RDWR)               = 3
fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
dup2(3, 0)                              = 0
dup2(3, 1)                              = 1
dup2(3, 2)                              = 2
close(3)                                = 0
open("/var/run/clurgmgrd.pid", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
write(3, "394", 3)                      = 3
close(3)                                = 0
munmap(0x2aaaaaaac000, 4096)            = 0
getpriority(PRIO_PROCESS, 0)            = 20
setpriority(PRIO_PROCESS, 0, 4294967295) = 0
getpriority(PRIO_PROCESS, 0)            = 21
clone(Process 395 attached (waiting for parent)
Process 395 resumed (parent 394 ready)
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaaabba90) = 395
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
[pid   394] rt_sigaction(SIG_0, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = -1 EINVAL (Invalid argument)
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [HUP], NULL, 8) = 0
[pid   394] rt_sigaction(SIGHUP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [INT], NULL, 8) = 0
[pid   394] rt_sigaction(SIGINT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [QUIT], NULL, 8) = 0
[pid   394] rt_sigaction(SIGQUIT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [ILL], NULL, 8) = 0
[pid   394] rt_sigaction(SIGILL, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [TRAP], NULL, 8) = 0
[pid   394] rt_sigaction(SIGTRAP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
[pid   394] rt_sigaction(SIGABRT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [BUS], NULL, 8) = 0
[pid   394] rt_sigaction(SIGBUS, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [FPE], NULL, 8) = 0
[pid   394] rt_sigaction(SIGFPE, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [KILL], NULL, 8) = 0
[pid   394] rt_sigaction(SIGKILL, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = -1 EINVAL (Invalid argument)
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
[pid   394] rt_sigaction(SIGUSR1, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0
[pid   394] rt_sigaction(SIGSEGV, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [USR2], NULL, 8) = 0
[pid   394] rt_sigaction(SIGUSR2, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [PIPE], NULL, 8) = 0
[pid   394] rt_sigaction(SIGPIPE, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
[pid   394] rt_sigaction(SIGALRM, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
[pid   394] rt_sigaction(SIGTERM, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [STKFLT], NULL, 8) = 0
[pid   394] rt_sigaction(SIGSTKFLT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   395] socket(PF_FILE, SOCK_STREAM, 0 <unfinished ...>
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [CHLD],  <unfinished ...>
[pid   395] <... socket resumed> )      = 3
[pid   394] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid   395] fcntl(3, F_SETFD, FD_CLOEXEC) = 0
[pid   394] rt_sigaction(SIGCHLD, {SIG_DFL},  <unfinished ...>
[pid   395] connect(3, {sa_family=AF_FILE, path="/var/run/cman_client"}, 110 <unfinished ...>
[pid   394] <... rt_sigaction resumed> {SIG_DFL}, 8) = 0
[pid   395] <... connect resumed> )     = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [CONT],  <unfinished ...>
[pid   395] open("/dev/zero", O_RDONLY <unfinished ...>
[pid   394] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid   395] <... open resumed> )        = 4
[pid   394] rt_sigaction(SIGCONT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40},  <unfinished ...>
[pid   395] writev(3, [{"NAMC\3\0\0\20\24\0\0\0\5\0\0\0\0\0\0\0", 20}], 1 <unfinished ...>
[pid   394] <... rt_sigaction resumed> {SIG_DFL}, 8) = 0
[pid   395] <... writev resumed> )      = 20
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [STOP],  <unfinished ...>
[pid   395] recvfrom(3,  <unfinished ...>
[pid   394] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid   395] <... recvfrom resumed> "NAMC\0\0\0\0\30\0\0\0\5\0\0@\0\0\0\0", 20, 0, NULL, NULL) = 20
[pid   395] read(3,  <unfinished ...>
[pid   394] rt_sigaction(SIGSTOP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40},  <unfinished ...>
[pid   395] <... read resumed> "\1\0\0\0", 4) = 4
[pid   394] <... rt_sigaction resumed> {SIG_DFL}, 8) = -1 EINVAL (Invalid argument)
[pid   395] pipe( <unfinished ...>
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [TSTP],  <unfinished ...>
[pid   395] <... pipe resumed> [5, 6])  = 0
[pid   395] fcntl(5, F_GETFL <unfinished ...>
[pid   394] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid   395] <... fcntl resumed> )       = 0 (flags O_RDONLY)
[pid   395] fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
[pid   394] rt_sigaction(SIGTSTP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40},  <unfinished ...>
[pid   395] <... fcntl resumed> )       = 0
[pid   394] <... rt_sigaction resumed> {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [TTIN],  <unfinished ...>
[pid   395] open("/dev/misc/dlm_rgmanager", O_RDWR <unfinished ...>
[pid   394] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid   394] rt_sigaction(SIGTTIN, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [TTOU], NULL, 8) = 0
[pid   394] rt_sigaction(SIGTTOU, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [URG], NULL, 8) = 0
[pid   394] rt_sigaction(SIGURG, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [XCPU], NULL, 8) = 0
[pid   394] rt_sigaction(SIGXCPU, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [XFSZ], NULL, 8) = 0
[pid   394] rt_sigaction(SIGXFSZ, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [VTALRM], NULL, 8) = 0
[pid   394] rt_sigaction(SIGVTALRM, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [PROF], NULL, 8) = 0
[pid   394] rt_sigaction(SIGPROF, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [WINCH], NULL, 8) = 0
[pid   394] rt_sigaction(SIGWINCH, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [IO], NULL, 8) = 0
[pid   394] rt_sigaction(SIGIO, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [PWR], NULL, 8) = 0
[pid   394] rt_sigaction(SIGPWR, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [SYS], NULL, 8) = 0
[pid   394] rt_sigaction(SIGSYS, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RTMIN], NULL, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_1], NULL, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_2], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_2, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_3], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_3, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_4], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_4, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_5], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_5, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_6], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_6, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_7], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_7, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_8], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_8, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_9], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_9, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_10], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_10, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_11], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_11, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_12], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_12, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_13], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_13, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_14], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_14, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_15], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_15, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_16], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_16, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_17], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_17, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_18], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_18, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_19], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_19, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_20], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_20, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_21], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_21, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_22], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_22, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_23], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_23, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_24], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_24, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_25], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_25, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_26], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_26, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_27], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_27, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_28], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_28, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_29], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_29, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_30], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_30, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [RT_31], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_31, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
[pid   394] rt_sigaction(SIGRT_32, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0
[pid   394] wait4(395, Process 394 suspended
 <unfinished ...>
[pid   395] <... open resumed> )        = -1 ENODEV (No such device)
[pid   395] stat("/dev/misc/dlm-control", {st_mode=S_IFCHR|0600, st_rdev=makedev(10, 62), ...}) = 0
[pid   395] open("/proc/misc", O_RDONLY) = 7
[pid   395] fstat(7, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid   395] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
[pid   395] read(7, "209 cpqci\n 60 dlm_clvmd\n 61 lock_dlm_plock\n 62 dlm-control\n 63 device-mapper\n144 nvram\n228 hpet\n135 rtc\n231 snapshot\n227 mcelog\n", 4096) = 128
[pid   395] close(7)                    = 0
[pid   395] munmap(0x2aaaaaaac000, 4096) = 0
[pid   395] open("/dev/misc/dlm-control", O_RDWR) = 7
[pid   395] fcntl(7, F_SETFD, FD_CLOEXEC) = 0
[pid   395] write(7, "\5\0\0\0\0\0\0\0\0\0\0\0\4\1\0\0\0\0\0\0\0\0\0\0rgmanager\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\300\233!Y<\0\0\0\200 O\f\0\0\0\0\330\333A\0\0\0\0\0\320*\0\203\377\177\0\0\0\0\0\0\0\0\0\0\263\270DY<\0\0\0 ", 113) = -1 EEXIST (File exists)
[pid   395] open("/proc/misc", O_RDONLY) = 8
[pid   395] fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid   395] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
[pid   395] read(8, "209 cpqci\n 60 dlm_clvmd\n 61 lock_dlm_plock\n 62 dlm-control\n 63 device-mapper\n144 nvram\n228 hpet\n135 rtc\n231 snapshot\n227 mcelog\n", 4096) = 128
[pid   395] read(8, "", 4096)           = 0
[pid   395] close(8)                    = 0
[pid   395] munmap(0x2aaaaaaac000, 4096) = 0
[pid   395] stat("/dev/misc/dlm_rgmanager", {st_mode=S_IFCHR|0644, st_rdev=makedev(10, 0), ...}) = 0
[pid   395] stat("/dev/misc/dlm_rgmanager", {st_mode=S_IFCHR|0644, st_rdev=makedev(10, 0), ...}) = 0
[pid   395] open("/dev/misc/dlm_rgmanager", O_RDWR) = -1 ENODEV (No such device)
[pid   395] write(2, "failed acquiring lockspace: No such device\n", 43) = 43
[pid   395] fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
[pid   395] ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff83002690) = -1 ENOTTY (Inappropriate ioctl for device)
[pid   395] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
[pid   395] write(1, "Locks not working!\n", 19) = 19
[pid   395] exit_group(-1)              = ?
Process 394 resumed
Process 395 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], 0, NULL) = 395
--- SIGCHLD (Child exited) @ 0 (0) ---
exit_group(255)                         = ?
Process 394 detached

From janne.peltonen at helsinki.fi  Sun Jul  1 11:30:40 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Sun, 1 Jul 2007 14:30:40 +0300
Subject: [Linux-cluster] Rgmanager fails to restart
In-Reply-To: <20070701111748.GA9103@helsinki.fi>
References: <20070701111748.GA9103@helsinki.fi>
Message-ID: <20070701113040.GB9103@helsinki.fi>

On Sun, Jul 01, 2007 at 02:17:48PM +0300, Janne Peltonen wrote:
> Hi!
> 
> Sometimes, when I have cleanly shut down rgmanager on one node, and the
> services have nicely migrated to other nodes, trying to start rgmanager
> fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
> device". clurgmgrd concludes that locks are not working and exits.
> (See strace output attached.)

Interesting. After the one node with failing rgmanagers was shot in the
head (there were no log lines about fencing, only two about deferring
fencing to an earlier node), the fenced node was left in 'off' state, and,
well, the other nodes had their services left running (but rgmanagers
apparently stuck - no more status checks an no response to the clustat
command). The node that (apparently, since there is no log entry) did
the fencing:

[jmmpelto at pcn2 ~]$ sudo cman_tool services
type             level name       id       state
fence            0     default    00010001 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm              1     clvmd      00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm              1     rgmanager  00020002 FAIL_ALL_STOPPED
[1 2 3 4]

Other nodes with rgmanager running:

[jmmpelto at pcn3 ~]$ sudo cman_tool services
type             level name       id       state       
fence            0     default    00010001 FAIL_START_WAIT
[2 3 4 100]
dlm              1     clvmd      00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm              1     rgmanager  00020002 FAIL_ALL_STOPPED
[1 2 3 4]

The fifth node without rgmanager:

[jmmpelto at pcnm ~]$ sudo cman_tool services
type             level name     id       state       
fence            0     default  00010001 FAIL_START_WAIT
[2 3 4 100]
dlm              1     clvmd    00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]

Er. What might be up.


--Janne



From janne.peltonen at helsinki.fi  Sun Jul  1 11:45:21 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Sun, 1 Jul 2007 14:45:21 +0300
Subject: [Linux-cluster] Rgmanager fails to restart
In-Reply-To: <20070701113040.GB9103@helsinki.fi>
References: <20070701111748.GA9103@helsinki.fi>
	<20070701113040.GB9103@helsinki.fi>
Message-ID: <20070701114521.GC9103@helsinki.fi>

The story continues...

On Sun, Jul 01, 2007 at 02:30:40PM +0300, Janne Peltonen wrote:
> > Sometimes, when I have cleanly shut down rgmanager on one node, and the
> > services have nicely migrated to other nodes, trying to start rgmanager
> > fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
> > device". clurgmgrd concludes that locks are not working and exits.
> > (See strace output attached.)
> Interesting. After the one node with failing rgmanagers was shot in the
> head (there were no log lines about fencing, only two about deferring
> fencing to an earlier node), the fenced node was left in 'off' state, and,
> well, the other nodes had their services left running (but rgmanagers
> apparently stuck - no more status checks an no response to the clustat
> command).

Now, the cluster node whose fencing resulted in a stuck system came up
and joined the cluster.

[jmmpelto at pcn1 ~]$ sudo cman_tool services
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 100]
dlm              1     clvmd    00000000 JOIN_STOP_WAIT
[1 2 3 4 100]
[jmmpelto at pcn1 ~]$ sudo cman_tool status
Version: 6.0.1
Config Version: 40
Cluster Name: mappi-primary
Cluster Id: 11929
Cluster Member: Yes
Cluster Generation: 184
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Quorum: 3
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: pcn1-hb
Node ID: 1
Multicast addresses: 239.192.46.199
Node addresses: 10.3.0.11

I killed the completely stuck pcn2-hb from there:

[jmmpelto at pcn1 ~]$ sudo cman_tool kill -n pcn2-hb

Log:

Jul  1 14:36:36 pcn2.mappi.helsinki.fi dlm_controld[4577]: cluster is down, exiting
Jul  1 14:36:36 pcn2.mappi.helsinki.fi gfs_controld[4583]: cluster is down, exiting
Jul  1 14:36:36 pcn2.mappi.helsinki.fi fenced[4571]: cluster is down, exiting
Jul  1 14:36:59 pcn2.mappi.helsinki.fi ccsd[4508]: Unable to connect to cluster infrastructure after 30 seconds.

Thereafter, node pcn3-hb fenced it, this time with log entries:

Jul  1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn2-hb not a cluster member after 0 sec post_fail_delay
Jul  1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn1-hb not a cluster member after 0 sec post_fail_delay
Jul  1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: fencing node "pcn2-hb"
Jul  1 14:38:08 pcn3.mappi.helsinki.fi fenced[4371]: fence "pcn2-hb" success
Jul  1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Attempt to close an unopened CCS descriptor (3012450).
Jul  1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Error while processing disconnect: Invalid request descriptor

But nobody tried to fence pcn1-hb (see the second log line). But apparently,
pcn3-hb tried to say something to pcn1-hb.

Jul  1 14:38:13 pcn1.mappi.helsinki.fi fenced[4461]: fencing deferred to prior member
Jul  1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/id" error -1 2
Jul  1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/control" error -1 2
Jul  1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/event_done" error -1 2

This time the services are in no specific state, but the rgmanager still does nothin constructive:

[jmmpelto at pcn3 ~]$ sudo cman_tool services
Password:
type             level name       id       state
fence            0     default    00010001 none
[1 3 4 100]
dlm              1     clvmd      00010002 none
[1 3 4 100]
dlm              1     rgmanager  00020002 none
[1 3 4]
[jmmpelto at pcn3 ~]$ sudo clustat
Timed out waiting for a response from Resource Group Manager
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  pcnm-hb                             100 Online
  pcn1-hb                               1 Online
  pcn2-hb                               2 Offline
  pcn3-hb                               3 Online, Local
  pcn4-hb                               4 Online

On node pcn1-hb:

[jmmpelto at pcn1 ~]$ sudo cman_tool services
type             level name       id       state
fence            0     default    00010001 none
[1 3 4 100]
dlm              1     clvmd      00010002 none
[1 3 4 100]
dlm              1     rgmanager  00020002 none
[1 3 4]
[jmmpelto at pcn1 ~]$
[jmmpelto at pcn1 ~]$
[jmmpelto at pcn1 ~]$ sudo clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  pcnm-hb                             100 Online
  pcn1-hb                               1 Online, Local
  pcn2-hb                               2 Offline
  pcn3-hb                               3 Online
  pcn4-hb                               4 Online

Er again.


--Janne



From janne.peltonen at helsinki.fi  Sun Jul  1 11:52:01 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Sun, 1 Jul 2007 14:52:01 +0300
Subject: [Linux-cluster] Rgmanager fails to restart
In-Reply-To: <20070701114521.GC9103@helsinki.fi>
References: <20070701111748.GA9103@helsinki.fi>
	<20070701113040.GB9103@helsinki.fi>
	<20070701114521.GC9103@helsinki.fi>
Message-ID: <20070701115201.GD9103@helsinki.fi>

On Sun, Jul 01, 2007 at 02:45:21PM +0300, Janne Peltonen wrote:
> Er again.

At this point, I said 'cman_tool leave force' on pcn1-hb, which resulted
in pcn3-hb fencing it. This time the fencing was successful, and the
rgmanagers on remaining nodes woke up.

All kinds of mayhem... It's always a ten-minute break in services if a
service has to be relocated from a node to another (the services are a
bit slow to start) (or longer, if a node is really down and the
rgmanagers elsewhere are stuck).


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From janne.peltonen at helsinki.fi  Sun Jul  1 12:12:39 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Sun, 1 Jul 2007 15:12:39 +0300
Subject: [Linux-cluster] At which point is a service relocated?
Message-ID: <20070701121239.GE9103@helsinki.fi>

Hi!

It seems to me that if a node comes up that is higher in the prioritized
failover domain of a service than the node that was running it, the
service doesn't always relocate. Is there some documentation on this
somewhere?

It appears that the service relocates if the difference between the
priorities of the old and new node is at least two.

Is there a way to modify this?

Thanks.


--Janne



From David.Schroeder at flinders.edu.au  Sun Jul  1 23:37:15 2007
From: David.Schroeder at flinders.edu.au (David Schroeder)
Date: Mon, 02 Jul 2007 09:07:15 +0930
Subject: [Linux-cluster] IP monitor failing periodically
In-Reply-To: <4686A3BF.9070609@cmiware.com>
References: <4686A3BF.9070609@cmiware.com>
Message-ID: <46883AAB.8000801@flinders.edu.au>

Hi Chris,

I am experiencing the same problem on RHEL 5 and I have a support 
request in with RedHat.

I was asked to increase the debug level by changing the <rm> line in the 
cluster configuration to:

<rm log_facility="local4" log_level="7">

I then needed to add "local4.* /var/log/cluster" to /etc/syslog.conf and 
run "service syslog restart".

To update the cluster configuration I needed to propagate the cluster 
configuration to both nodes:

# ccs_tool update /etc/cluster/cluster.conf

After a week I have not had the problem with the increased logging 
despite the problem occurring regularly prior to that - 2 to 3 times a 
day. One day last week out of curiosity I reverted to the default 
settings and within a few hours I had the failure to ping error on one 
of the clustered IP addresses and the service was restarted.

I now have the logging back at 7 and the support request is pending.

Regards
-- 
David Schroeder
Server Support
Information Services Division
Flinders University
Adelaide, Australia
Ph: +61 8 8201 2689


Chris Harms wrote:
> I am experiencing periodic failovers due to a floating IP address not 
> passing the status check:
> 
> clurgmgrd: [9975]: <warning> Failed to ping 192.168.13.204
> Jun 30 11:41:47 nodeA clurgmgrd[9975]: <notice> status on ip 
> "192.168.13.204" returned 1 (generic error)
> 
> Both nodes have bonded NICs with gigabit connections to redundant 
> switches, so it is unlikely they are going down, nothing in the logs 
> about linux losing the links.  I parked all the cluster services - 2 
> Postgres services and 1 Apache - on one node and allowed it to run 
> overnight.  There would be no client activity during this time. One 
> Postgres service failed two times in this manner and the other failed 
> once in this manner.  The Apache service did not fail.
> 
> What can I do to resolve this or get more information out of the system 
> to resolve this?
> 
> Thanks in advance,
> Chris
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From pastany at gmail.com  Mon Jul  2 07:23:35 2007
From: pastany at gmail.com (pastany)
Date: Mon, 2 Jul 2007 15:23:35 +0800
Subject: [Linux-cluster] SCSI Error 
Message-ID: <200706272056059379538@gmail.com>

I am running a 4 node cluster with a fc switch and a fujitsu fc san

but i recevie this message ,and some partions dont work

Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 668992848
Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: fatal: I/O error
Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0:   block = 83624058
Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0:   function = gfs_dreread
Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0:   file = /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dio.c, line = 576
Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0:   time = 1182943429
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636680
Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: about to withdraw from the cluster
Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: waiting for outstanding I/O
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636688
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636696
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636704
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636712
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636720
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636728
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636736
Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:50 test1 kernel: end_request: I/O error, dev sdb, sector 418636744
Jun 27 19:23:50 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:50 test1 kernel: end_request: I/O error, dev sdb, sector 418636752
Jun 27 19:23:50 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000
Jun 27 19:23:50 test1 kernel: end_request: I/O error, dev sdb, sector 418636760
Jun 27 19:23:50 test1 kernel: GFS: fsid=tvod:vod.0: telling LM to withdraw
Jun 27 19:33:37 test1 kernel: lock_dlm: withdraw abandoned memory
Jun 27 19:33:37 test1 kernel: GFS: fsid=tvod:vod.0: withdraw




pastany
2007-06-27
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070702/0ea3e139/attachment.htm>

From rpeterso at redhat.com  Mon Jul  2 13:58:20 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 02 Jul 2007 08:58:20 -0500
Subject: [Linux-cluster] SCSI Error
In-Reply-To: <200706272056059379538@gmail.com>
References: <200706272056059379538@gmail.com>
Message-ID: <1183384700.11507.75.camel@technetium.msp.redhat.com>

On Mon, 2007-07-02 at 15:23 +0800, pastany wrote:
> I am running a 4 node cluster with a fc switch and a fujitsu fc san 
> but i recevie this message ,and some partions dont work
>  
> Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code =
> 0x10000
> Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector
> 668992848

Hi Pastany,

This sounds like a hardware problem to me, not a GFS problem.
It could be a bad drive, bad san or bad Host Bus Adapter (HBA).
Perhaps you should unmount the san from all the nodes, then
from one node, do a simple read test of the san:

1. Check dmesg and maybe clear your dmesg buffer:
   dmesg -c
2. Try reading every sector of the san:
   dd if=/dev/sdb of=/dev/null bs=1M
3. Check your console / dmesg to see if SCSI errors are reported.

You may want to try that separately on a few different nodes just
in case the error was caused by a bad HBA in the node that 
reported the problem.

Regards,

Bob Peterson
Red Hat Cluster Suite




From chris at cmiware.com  Mon Jul  2 15:26:59 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 02 Jul 2007 10:26:59 -0500
Subject: [Linux-cluster] issues starting services
Message-ID: <46891943.6010202@cmiware.com>

Had a nice little hardware failure over the weekend.  After having the 
machine come back on-line, 2 of the registered services didn't start 
(the fact that they weren't running already is a function of services 
not failing over until fencing succeeds).

Issuing a start operation in Conga did nothing. 

Issuing  clusvcadm -e [service] -m [node] yielded:
    Member [node] trying to enable service:[service]...Success
    service:[service] is now running on [node]
This was not the case.  Nothing happened.  Nothing was logged.

clusvcadm -R [service] seemed to be the magic bullet.  Is that the 
official way to recover a service?

Chris



From jwilson at transolutions.net  Mon Jul  2 20:10:49 2007
From: jwilson at transolutions.net (James Wilson)
Date: Mon, 02 Jul 2007 15:10:49 -0500
Subject: [Linux-cluster] failover not working
Message-ID: <46895BC9.5010101@transolutions.net>

Hey All,

    I was just wondering if someone could point out my errors? I 
currently have 3 servers in a cluster server1(dolphins), server2(lions), 
server3(patriots). server1 and server3 are being mirrored via DRBD. I 
have set up the cluster so that if server1 fails then server3 will take 
over. I have configured a vip to go between the 2 servers. I also do a 
gnbd_import on this vip from server2. The problem is when ever I pull 
the plug on server1 the vip never moves over to server3. here is a copy 
of my cluster.conf. Any help is appreciated.


<?xml version="1.0"?>
<cluster config_version="5" name="nas-cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="dolphins" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nas-fencing" 
nodename="dolphins"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="lions" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nas-gnbd-fencing" 
nodename="lions"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="server3" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nas-fencing" 
nodename="server2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_gnbd" name="nas-gnbd-fencing" 
servers="dolphins"/>
                <fencedevice agent="fence_manual" name="nas-fencing"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="dolphins-drbd1" ordered="1">
                                <failoverdomainnode name="dolphins" 
priority="1"/>
                                <failoverdomainnode name="patriots" 
priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="dolphins-drbd2" 
ordered="1" restricted="0">
                                <failoverdomainnode name="dolphins" 
priority="2"/>
                                <failoverdomainnode name="patriots" 
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.5.4" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="dolphins-drbd1" 
name="dolphins-svc-drbd1" recovery="relocate">
                        <ip ref="192.168.5.4"/>
                </service>
                <service autostart="1" domain="dolphins-drbd2" 
name="dolphins-svc-drbd2" recovery="relocate">
                        <ip ref="192.168.5.4"/>
                </service>
        </rm>
        <fence_xvmd/>
        <fence_xvmd/>
</cluster>








From tomas.hoger at gmail.com  Tue Jul  3 11:19:44 2007
From: tomas.hoger at gmail.com (Tomas Hoger)
Date: Tue, 3 Jul 2007 13:19:44 +0200
Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster
Message-ID: <6cfbd1b40707030419i6fa984b0m37c827f834400cfb@mail.gmail.com>

Hi!

I've come across a problem with two-node cluster on RHEL 4U3.  When I
attempt to reboot one of the nodes, it sometimes fails to leave
cluster correctly.

Before reboot, both nodes are cluster members and it is possible to
fail-over services from one node to another.  When I try to reboot
node1 (active at that time), services fail-over to node2, however,
cman fails to stop correctly:

cman: Stopping cman:
cman: failed to stop cman failed

node2 logs following message:

kernel: CMAN: removing node node1 from the cluster : Missed too many heartbeats

I see no information about fencing attempts in the log.


After node1's reboot, it is not able to rejoin cluster any more.

node1:

kernel: CMAN: Waiting to join or form a Linux-cluster
kernel: CMAN: sending membership request
kernel: CMAN: got node node2
cman: Timed-out waiting for cluster failed

While on node2:

kernel: CMAN: node node1 rejoining

and after ~4.5 minutes:

kernel: CMAN: too many transition restarts - will die
kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
kernel: WARNING: dlm_emergency_shutdown
clurgmgrd[2848]: <warning> #67: Shutting down uncleanly
kernel: WARNING: dlm_emergency_shutdown
kernel: SM: 00000001 sm_stop: SG still joined
kernel: SM: 01000003 sm_stop: SG still joined
kernel: SM: 03000002 sm_stop: SG still joined
ccsd[2242]: Cluster is not quorate.  Refusing connection.
ccsd[2242]: Error while processing connect: Connection refused
ccsd[2242]: Invalid descriptor specified (-111).
ccsd[2242]: Someone may be attempting something evil.
ccsd[2242]: Error while processing get: Invalid request descriptor
ccsd[2242]: Invalid descriptor specified (-111).
ccsd[2242]: Someone may be attempting something evil.
ccsd[2242]: Error while processing get: Invalid request descriptor
ccsd[2242]: Invalid descriptor specified (-21).

and again ~1 minute later on node1:

kernel: CMAN: removing node node2 from the cluster : No response to messages
kernel: ------------[ cut here ]------------
kernel: kernel BUG at
/usr/src/build/714635-i686/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150!
kernel: invalid operand: 0000 [#1]
kernel: SMP
kernel: Modules linked in: cman(U) md5 ipv6 iptable_filter ip_tables
button battery ac uhci_hcd ehci_hcd hw_random tg3 floppy sg st mptspi
mptscsi mptbase dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss
sd_mod scsi_mod
kernel: CPU:    0
kernel: EIP:    0060:[<f8bebe2a>]    Not tainted VLI
kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp)
kernel: EIP is at elect_master+0x2e/0x3a [cman]
kernel: eax: 00000000   ebx: f7b4afa0   ecx: 00000080   edx: 00000080
kernel: esi: f8bff044   edi: f7b4afd8   ebp: 00000000   esp: f7b4af98
kernel: ds: 007b   es: 007b   ss: 0068
kernel: Process cman_memb (pid: 2429, threadinfo=f7b4a000 task=c1a33230)
kernel: Stack: f8bfef08 f8be98d1 c1a7c580 f6e8ee00 f8be7eb7 c1a33230
c1a33230 f8be809a
kernel:        0000001f 00000000 f7b460b0 00000000 c1a33230 c011e71b
00100100 00200200
kernel:        00000000 00000000 0000007b f8be7ed8 00000000 00000000
c01041f5 00000000
kernel: Call Trace:
kernel:  [<f8be98d1>] a_node_just_died+0x13a/0x199 [cman]
kernel:  [<f8be7eb7>] process_dead_nodes+0x4e/0x6f [cman]
kernel:  [<f8be809a>] membership_kthread+0x1c2/0x39d [cman]
kernel:  [<c011e71b>] default_wake_function+0x0/0xc
kernel:  [<f8be7ed8>] membership_kthread+0x0/0x39d [cman]
kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
kernel: Code: 28 fe bf f8 89 c3 ba 01 00 00 00 39 ca 7d 1c a1 2c fe bf
f8 8b 04 90 85 c0 74 0d 83 78 1c 02 75 07 89 03 8b 40 14 eb 0d 42 eb
e0 <0f> 0b 4e 0c 73 2d bf f8 31 c0 5b c3 a1 2c fe bf f8 e8 79 70 56
kernel:  <0>Fatal exception: panic in 5 seconds


During one other test, cluster did not crash, it just ended in the
state, when cman on rebooted node kept sending cluster membership
requests and those requests were ignored by other cluster node.
Output of tcpdump showed traffic was reaching active node, but there
was no reply nor any message in the logs of active node.  Only way to
get to normal state is to restart cman on active node (or reboot both
nodes).


If I try to reboot one of cluster nodes shortly after rebooting both
nodes, it seems to leave and rejoin cluster successfully.

Has anyone observed similar behavior?  Is this known bug in U3, which
can be resolved by upgrade to latest version?  I've checked changelogs
and release notes (Btw, any chance to get back to "old" release notes
format for RHCS?  Release notes for U5 do not longer list fixed
bugzilla reports, only links some errata listings, which do not seem
to be accessible from Internet.), but haven't found any obvious
reference to this king of problem.

Ideas appreciated.

th.



From jbrassow at redhat.com  Tue Jul  3 20:59:59 2007
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Tue, 3 Jul 2007 15:59:59 -0500
Subject: [Linux-cluster] failover not working
In-Reply-To: <46895BC9.5010101@transolutions.net>
References: <46895BC9.5010101@transolutions.net>
Message-ID: <531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com>


On Jul 2, 2007, at 3:10 PM, James Wilson wrote:

>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="dolphins-drbd1"  
> ordered="1">
>                                <failoverdomainnode name="dolphins"  
> priority="1"/>
>                                <failoverdomainnode name="patriots"  
> priority="2"/>
>                        </failoverdomain>
>                        <failoverdomain name="dolphins-drbd2"  
> ordered="1" restricted="0">
>                                <failoverdomainnode name="dolphins"  
> priority="2"/>
>                                <failoverdomainnode name="patriots"  
> priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources>
>                        <ip address="192.168.5.4" monitor_link="1"/>
>                </resources>
>                <service autostart="1" domain="dolphins-drbd1"  
> name="dolphins-svc-drbd1" recovery="relocate">
>                        <ip ref="192.168.5.4"/>
>                </service>
>                <service autostart="1" domain="dolphins-drbd2"  
> name="dolphins-svc-drbd2" recovery="relocate">
>                        <ip ref="192.168.5.4"/>
>                </service>
>        </rm>

That section looks funny.  You don't need the two 'failoverdomain's;  
and you don't need the two services.  Does rgmanager even startup?   
Check /var/log/messages for more info.

  brassow



From jwilson at transolutions.net  Tue Jul  3 21:08:11 2007
From: jwilson at transolutions.net (James Wilson)
Date: Tue, 03 Jul 2007 16:08:11 -0500
Subject: [Linux-cluster] failover not working
In-Reply-To: <531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com>
References: <46895BC9.5010101@transolutions.net>
	<531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com>
Message-ID: <468ABABB.6080104@transolutions.net>

I have changed it since I posted this. But I though I needed the 2 
failover domains? One for each host so if dolphins fails it failsover to 
patriots and vice versa. Or do I just need one because of the virtual IP?


                <failoverdomains>
                        <failoverdomain name="dolphins-drbd1" ordered="1">
                                <failoverdomainnode name="dolphins" 
priority="1"/>
                                <failoverdomainnode name="patriots" 
priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="dolphins-drbd2" 
ordered="1" restricted="0">
                                <failoverdomainnode name="dolphins" 
priority="2"/>
                                <failoverdomainnode name="patriots" 
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.5.4" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="dolphins-drbd1" 
name="dolphins-svc-drbd1" recovery="relocate">
                        <ip ref="192.168.5.4"/>
                </service>
                <service autostart="1" domain="dolphins-drbd2" 
name="dolphins-svc-drbd2" recovery="relocate">
                        <ip ref="192.168.5.4"/>
                </service>
        </rm>
        <fence_xvmd/>
        <fence_xvmd/>


Jonathan Brassow wrote:
>
> On Jul 2, 2007, at 3:10 PM, James Wilson wrote:
>
>>
>>        <rm>
>>                <failoverdomains>
>>                        <failoverdomain name="dolphins-drbd1" 
>> ordered="1">
>>                                <failoverdomainnode name="dolphins" 
>> priority="1"/>
>>                                <failoverdomainnode name="patriots" 
>> priority="2"/>
>>                        </failoverdomain>
>>                        <failoverdomain name="dolphins-drbd2" 
>> ordered="1" restricted="0">
>>                                <failoverdomainnode name="dolphins" 
>> priority="2"/>
>>                                <failoverdomainnode name="patriots" 
>> priority="1"/>
>>                        </failoverdomain>
>>                </failoverdomains>
>>                <resources>
>>                        <ip address="192.168.5.4" monitor_link="1"/>
>>                </resources>
>>                <service autostart="1" domain="dolphins-drbd1" 
>> name="dolphins-svc-drbd1" recovery="relocate">
>>                        <ip ref="192.168.5.4"/>
>>                </service>
>>                <service autostart="1" domain="dolphins-drbd2" 
>> name="dolphins-svc-drbd2" recovery="relocate">
>>                        <ip ref="192.168.5.4"/>
>>                </service>
>>        </rm>
>
> That section looks funny.  You don't need the two 'failoverdomain's; 
> and you don't need the two services.  Does rgmanager even startup?  
> Check /var/log/messages for more info.
>
>  brassow
>



From chris at cmiware.com  Tue Jul  3 23:20:23 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 03 Jul 2007 18:20:23 -0500
Subject: [Linux-cluster] dual fence redux
Message-ID: <468AD9B7.4060100@cmiware.com>

To recap:
I am attempting to setup a 2 node cluster where each will run a DB and 
an apache service to be failed over between them.  Both are fenced via 
Dell DRAC connected via the system NICs (this adds to the issue, but 
manual fencing is broken).

My test case so far is to unplug the network cables from one node and 
then reconnect them.  For some reason, both machines get halted instead 
of one machine being fenced.  Having only one node fenced in this 
scenario has only occurred successfully one time.

I previously suspected DRBD as being the culprit, but I can now rule 
this out after performing the cable pull test without RHCS running, and 
having DRBD in every possible configuration the cluster could put it in, 
including a split brain (which is impossible for me due to services not 
failing over until fencing occurs).

Is there any component of the cluster system that would issue the 
shutdown command shown in the log entry below?

[From logs on Node A]

Jul  3 17:36:20 nodeA openais[3504]: [MAIN ] Killing node nodeB because 
it has rejoined the cluster without cman_tool join
Jul  3 17:36:20 nodeA kernel: drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Jul  3 17:36:21 nodeA kernel: drbd0: Writing meta data super block now.
Jul  3 17:36:21 nodeA kernel: drbd0: conn( WFBitMapS -> SyncSource ) 
pdsk( UpToDate -> Inconsistent )
Jul  3 17:36:21 nodeA kernel: drbd0: Began resync as SyncSource (will 
sync 56 KB [14 bits set]).
Jul  3 17:36:21 nodeA kernel: drbd0: Writing meta data super block now.
Jul  3 17:36:21 nodeA kernel: drbd0: Resync done (total 1 sec; paused 0 
sec; 56 K/sec)
Jul  3 17:36:21 nodeA kernel: drbd0: conn( SyncSource -> Connected ) 
pdsk( Inconsistent -> UpToDate )
Jul  3 17:36:21 nodeA kernel: drbd0: Writing meta data super block now.
Jul  3 17:36:21 nodeA shutdown[18845]: shutting down for system halt


Thanks to a hardware issue on NodeB, I am unable to get to the logs off 
of it presently.





From bsd_daemon at msn.com  Wed Jul  4 08:52:37 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Wed, 04 Jul 2007 08:52:37 +0000
Subject: [Linux-cluster] failover not working
In-Reply-To: <468ABABB.6080104@transolutions.net>
Message-ID: <BAY105-F39D243057C65747DD88DF2E3030@phx.gbl>


You should use Linux H.A. (heartbeat) for failover. the rgmanager very good 
but not yet. you not must use failover with rgmanager. you can get some 
problems because of priority and two failover happened some the wrong.

have a nice day..

>From: James Wilson <jwilson at transolutions.net>
>Reply-To: jwilson at transolutions.net,linux clustering 
><linux-cluster at redhat.com>
>To: Jonathan Brassow <jbrassow at redhat.com>
>CC: linux clustering <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] failover not working
>Date: Tue, 03 Jul 2007 16:08:11 -0500
>
>I have changed it since I posted this. But I though I needed the 2 failover 
>domains? One for each host so if dolphins fails it failsover to patriots 
>and vice versa. Or do I just need one because of the virtual IP?
>
>
>                <failoverdomains>
>                        <failoverdomain name="dolphins-drbd1" ordered="1">
>                                <failoverdomainnode name="dolphins" 
>priority="1"/>
>                                <failoverdomainnode name="patriots" 
>priority="2"/>
>                        </failoverdomain>
>                        <failoverdomain name="dolphins-drbd2" ordered="1" 
>restricted="0">
>                                <failoverdomainnode name="dolphins" 
>priority="2"/>
>                                <failoverdomainnode name="patriots" 
>priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources>
>                        <ip address="192.168.5.4" monitor_link="1"/>
>                </resources>
>                <service autostart="1" domain="dolphins-drbd1" 
>name="dolphins-svc-drbd1" recovery="relocate">
>                        <ip ref="192.168.5.4"/>
>                </service>
>                <service autostart="1" domain="dolphins-drbd2" 
>name="dolphins-svc-drbd2" recovery="relocate">
>                        <ip ref="192.168.5.4"/>
>                </service>
>        </rm>
>        <fence_xvmd/>
>        <fence_xvmd/>
>
>
>Jonathan Brassow wrote:
>>
>>On Jul 2, 2007, at 3:10 PM, James Wilson wrote:
>>
>>>
>>>        <rm>
>>>                <failoverdomains>
>>>                        <failoverdomain name="dolphins-drbd1" 
>>>ordered="1">
>>>                                <failoverdomainnode name="dolphins" 
>>>priority="1"/>
>>>                                <failoverdomainnode name="patriots" 
>>>priority="2"/>
>>>                        </failoverdomain>
>>>                        <failoverdomain name="dolphins-drbd2" ordered="1" 
>>>restricted="0">
>>>                                <failoverdomainnode name="dolphins" 
>>>priority="2"/>
>>>                                <failoverdomainnode name="patriots" 
>>>priority="1"/>
>>>                        </failoverdomain>
>>>                </failoverdomains>
>>>                <resources>
>>>                        <ip address="192.168.5.4" monitor_link="1"/>
>>>                </resources>
>>>                <service autostart="1" domain="dolphins-drbd1" 
>>>name="dolphins-svc-drbd1" recovery="relocate">
>>>                        <ip ref="192.168.5.4"/>
>>>                </service>
>>>                <service autostart="1" domain="dolphins-drbd2" 
>>>name="dolphins-svc-drbd2" recovery="relocate">
>>>                        <ip ref="192.168.5.4"/>
>>>                </service>
>>>        </rm>
>>
>>That section looks funny.  You don't need the two 'failoverdomain's; and 
>>you don't need the two services.  Does rgmanager even startup?  Check 
>>/var/log/messages for more info.
>>
>>  brassow
>>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_pcmag_0507



From Donald.Deasy at sli-institute.ac.uk  Wed Jul  4 10:01:49 2007
From: Donald.Deasy at sli-institute.ac.uk (Donald Deasy)
Date: Wed, 4 Jul 2007 11:01:49 +0100
Subject: [Linux-cluster] Is cluster 1 RHEL45 correct branch for 2.6.9-55.0.2
	?
Message-ID: <7D7A68F7F09DAE40AE47E55F7F601D8BD37225@SLISERVER21.sli-institute.ac.uk>

Just tried compiling cluster on 2.6.9-55.0.2 from the RHEL45 CVS
repository.

Got a few errors which I fixed by creating links to the
build/incdir/cluster directory, which may be the problem !

Cannot figure out why it's failing at gfs/gfs_edit 
make[4]: Entering directory
`/root/sources.redhat.com/RHEL45/cluster/gfs/gfs_edit'
gcc -Wall -I../include -I../config
-I/root/sources.redhat.com/RHEL45/cluster/build/incdir -DHELPER_PROGRAM
-D_FILE_OFFSET_BITS=64 -DGFS_RELEASE_NAME=\"DEVEL.1183478317\"
-I../include -I../config
-I/root/sources.redhat.com/RHEL45/cluster/build/incdir  gfshex.c
hexedit.c   -lncurses -o gfs_edit
In file included from gfshex.c:29:
/root/sources.redhat.com/RHEL45/cluster/build/incdir/linux/gfs_ondisk.h:
626: error: syntax error before "__be64"

Is cluster 1 RHEL45 correct branch for 2.6.9-55.0.2 ?





From bsd_daemon at msn.com  Wed Jul  4 09:45:44 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Wed, 04 Jul 2007 09:45:44 +0000
Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster
In-Reply-To: <6cfbd1b40707030419i6fa984b0m37c827f834400cfb@mail.gmail.com>
Message-ID: <BAY105-F160472C1F9127656833AE5E3030@phx.gbl>


hi tomas,

when you do restart. which services run on the node1 ???

>node2 logs following message:
>
>kernel: CMAN: removing node node1 from the cluster : Missed too many 
>heartbeats

when network problem, you get this error.

>kernel: CMAN: too many transition restarts - will die
>kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
>kernel: WARNING: dlm_emergency_shutdown
>clurgmgrd[2848]: <warning> #67: Shutting down uncleanly
>kernel: WARNING: dlm_emergency_shutdown
>kernel: SM: 00000001 sm_stop: SG still joined
>kernel: SM: 01000003 sm_stop: SG still joined
>kernel: SM: 03000002 sm_stop: SG still joined
>ccsd[2242]: Cluster is not quorate.  Refusing connection.
>ccsd[2242]: Error while processing connect: Connection refused
>ccsd[2242]: Invalid descriptor specified (-111).
>ccsd[2242]: Someone may be attempting something evil.
>ccsd[2242]: Error while processing get: Invalid request descriptor
>ccsd[2242]: Invalid descriptor specified (-111).
>ccsd[2242]: Someone may be attempting something evil.
>ccsd[2242]: Error while processing get: Invalid request descriptor
>ccsd[2242]: Invalid descriptor specified (-21).
>
>and again ~1 minute later on node1:
>
>kernel: CMAN: removing node node2 from the cluster : No response to 
>messages
>kernel: ------------[ cut here ]------------
>kernel: kernel BUG at
>/usr/src/build/714635-i686/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150!
>kernel: invalid operand: 0000 [#1]
>kernel: SMP

i thing this error a bug. did you check this error from bugzilla ?

_________________________________________________________________
Local listings, incredible imagery, and driving directions - all in one 
place! http://maps.live.com/?wip=69&FORM=MGAC01



From manjusc13 at rediffmail.com  Wed Jul  4 11:35:31 2007
From: manjusc13 at rediffmail.com (manjunath c shanubog)
Date: 4 Jul 2007 11:35:31 -0000
Subject: [Linux-cluster] Mysql installation on Cluster
Message-ID: <20070704113531.518.qmail@webmail81.rediffmail.com>

Hi,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I need complete installation guide for installing cluster using redhat EL 5, and Mysql installtion guide on the&nbsp;cluster.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;which fencing device is better whether APC 9120&nbsp;or NPS 230 from Western telematic.Thanking YouManjunath&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070704/512f1a8d/attachment.htm>

From wcheng at redhat.com  Wed Jul  4 15:19:11 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 04 Jul 2007 11:19:11 -0400
Subject: [Linux-cluster] Is cluster 1 RHEL45 correct branch for
	2.6.9-55.0.2 ?
In-Reply-To: <7D7A68F7F09DAE40AE47E55F7F601D8BD37225@SLISERVER21.sli-institute.ac.uk>
References: <7D7A68F7F09DAE40AE47E55F7F601D8BD37225@SLISERVER21.sli-institute.ac.uk>
Message-ID: <468BBA6F.1000702@redhat.com>

Donald Deasy wrote:

>Just tried compiling cluster on 2.6.9-55.0.2 from the RHEL45 CVS
>repository.
>
>Got a few errors which I fixed by creating links to the
>build/incdir/cluster directory, which may be the problem !
>
>Cannot figure out why it's failing at gfs/gfs_edit 
>make[4]: Entering directory
>`/root/sources.redhat.com/RHEL45/cluster/gfs/gfs_edit'
>gcc -Wall -I../include -I../config
>-I/root/sources.redhat.com/RHEL45/cluster/build/incdir -DHELPER_PROGRAM
>-D_FILE_OFFSET_BITS=64 -DGFS_RELEASE_NAME=\"DEVEL.1183478317\"
>-I../include -I../config
>-I/root/sources.redhat.com/RHEL45/cluster/build/incdir  gfshex.c
>hexedit.c   -lncurses -o gfs_edit
>In file included from gfshex.c:29:
>/root/sources.redhat.com/RHEL45/cluster/build/incdir/linux/gfs_ondisk.h:
>626: error: syntax error before "__be64"
>
>Is cluster 1 RHEL45 correct branch for 2.6.9-55.0.2 ?
>
>  
>
Sorry, it was my oversight. The problem is fixed in RHEL4 (queued for 
4.6) branch but not RHEL45. Will have to discuss with our PM to see what 
needs to be done to get the changes added into RHEL45 branch. In the 
mean time, please take the patch from:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=239523#c2

Let me know if you have more issues.

-- Wendy



From tomas.hoger at gmail.com  Wed Jul  4 16:36:46 2007
From: tomas.hoger at gmail.com (Tomas Hoger)
Date: Wed, 4 Jul 2007 18:36:46 +0200
Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster
In-Reply-To: <BAY105-F160472C1F9127656833AE5E3030@phx.gbl>
References: <6cfbd1b40707030419i6fa984b0m37c827f834400cfb@mail.gmail.com>
	<BAY105-F160472C1F9127656833AE5E3030@phx.gbl>
Message-ID: <6cfbd1b40707040936y6d60e3dbt5c6c03b171d0e523@mail.gmail.com>

On 7/4/07, mehmet celik <bsd_daemon at msn.com> wrote:
> when you do restart. which services run on the node1 ???

Cluster only use one service (consisting of IP, filesystems and
applications) and it was running on node1 before it was rebooted.
Logs also show that service was moved to node2 during node1's
shutdown.

> when network problem, you get this error.

We haven't noticed any network-related problems.

th.



From bsd_daemon at msn.com  Thu Jul  5 07:20:07 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Thu, 05 Jul 2007 07:20:07 +0000
Subject: [Linux-cluster] Mysql installation on Cluster
In-Reply-To: <20070704113531.518.qmail@webmail81.rediffmail.com>
Message-ID: <BAY105-F961BFB227F2ECB59BD390E3020@phx.gbl>


hii manjunath,

how will you work the mysql cluster ? I know two way for the mysql-cluster.

1. active-passive (failover)
2. active-active

you don't active-active, because it's not be with RHCS. for this, you have 
to visit mysql.com


>From: "manjunath c shanubog" <manjusc13 at rediffmail.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: <linux-cluster at redhat.com>
>Subject: [Linux-cluster] Mysql installation on Cluster
>Date: 4 Jul 2007 11:35:31 -0000
>
>Hi,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I need complete installation guide for 
>installing cluster using redhat EL 5, and Mysql installtion guide on 
>the&nbsp;cluster.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;which fencing device 
>is better whether APC 9120&nbsp;or NPS 230 from Western telematic.Thanking 
>YouManjunath&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://liveearth.msn.com



From bsd_daemon at msn.com  Thu Jul  5 07:55:45 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Thu, 05 Jul 2007 07:55:45 +0000
Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster
In-Reply-To: <6cfbd1b40707040936y6d60e3dbt5c6c03b171d0e523@mail.gmail.com>
Message-ID: <BAY105-F14F093D7453044B9EF2EDFE3020@phx.gbl>


hii Tomas,

"missed too ... heartbeat ..." this error is generally network and 
comminucation problems. You should tcpdump for the event. you using tcpdump, 
find source of this error.

example cman to cman communication,

16:40:13.540514 IP node1.domain.com.6809 > 10.0.0.255.6809: UDP, length 28
16:40:14.037110 IP node2.domain.com.6809 > 10.0.0.255.6809: UDP, length 28
16:40:15.059749 IP node3.domain.com.6809 > 10.0.0.255.6809: UDP, length 28
16:41:28.568924 IP node1.domain.com.6809 > 10.0.0.255.6809: UDP, length 28
16:41:29.016120 IP node2.domain.com.6809 > 10.0.0.255.6809: UDP, length 28
16:41:30.046889 IP node3.domain.com.6809 > 10.0.0.255.6809: UDP, length 28


>From: "Tomas Hoger" <tomas.hoger at gmail.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster
>Date: Wed, 4 Jul 2007 18:36:46 +0200
>
>On 7/4/07, mehmet celik <bsd_daemon at msn.com> wrote:
>>when you do restart. which services run on the node1 ???
>
>Cluster only use one service (consisting of IP, filesystems and
>applications) and it was running on node1 before it was rebooted.
>Logs also show that service was moved to node2 during node1's
>shutdown.
>
>>when network problem, you get this error.
>
>We haven't noticed any network-related problems.
>
>th.
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://liveearth.msn.com



From teigland at redhat.com  Thu Jul  5 18:14:39 2007
From: teigland at redhat.com (David Teigland)
Date: Thu, 5 Jul 2007 13:14:39 -0500
Subject: [Linux-cluster] dual fence redux
In-Reply-To: <468AD9B7.4060100@cmiware.com>
References: <468AD9B7.4060100@cmiware.com>
Message-ID: <20070705181439.GA23666@redhat.com>

On Tue, Jul 03, 2007 at 06:20:23PM -0500, Chris Harms wrote:
> To recap:
> I am attempting to setup a 2 node cluster where each will run a DB and 
> an apache service to be failed over between them.  Both are fenced via 
> Dell DRAC connected via the system NICs (this adds to the issue, but 
> manual fencing is broken).
> 
> My test case so far is to unplug the network cables from one node and 
> then reconnect them.  For some reason, both machines get halted instead 
> of one machine being fenced.  Having only one node fenced in this 
> scenario has only occurred successfully one time.
> 
> I previously suspected DRBD as being the culprit, but I can now rule 
> this out after performing the cable pull test without RHCS running, and 
> having DRBD in every possible configuration the cluster could put it in, 
> including a split brain (which is impossible for me due to services not 
> failing over until fencing occurs).
> 
> Is there any component of the cluster system that would issue the 
> shutdown command shown in the log entry below?

Perhaps qdisk?

Dave



From bmarzins at redhat.com  Thu Jul  5 20:01:28 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 5 Jul 2007 15:01:28 -0500
Subject: [linux-cluster] multipath issue... Smells of hardware issue.
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk>
References: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk>
Message-ID: <20070705200128.GC27466@ether.msp.redhat.com>

On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer Lippert wrote:
>    Hi,
> 
>    I have a setup with two identical RX200s3 FuSi servers talking to a SAN
>    (SX60 + extra controller), and that works fine with gfs1.
> 
>    I do however see some errors on one of the servers. It's in my message log
>    and only now and then now and then (though always under load, but i cant
>    load it and thereby force it to give the error).
> 
>    The error says:
>    Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed
>    Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector
>    705160231
>    Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 28 15:44:22 app02 multipathd: 8:16: reinstated
>    Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
>    Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed
>    Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector
>    739870727
>    Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path 8:32.
>    Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path is
>    up
>    Jun 28 15:46:06 app02 multipathd: 8:32: reinstated
>    Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    To me i looks like a fiber that bounces up and down. (There is no switch
>    involved).
> 
>    Sometimes i only get a slightly shorter version:
>    Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector
>    2782490295
>    Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed
>    Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 29 09:04:37 app02 multipathd: 8:16: reinstated
>    Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    Any sugestions, but start swapping hardware?

It's possible that your scsi device is timing out the scsi read command from the
readsector0 path checker, which is what it appears that your setup is using to
check the path status.  This checker has it's timeout set to 5 minutes, but I
suppose that it is possible to take this long if your hardware is a flaky. If
you're willing to recompile the code, you can change this default by changing
DEF_TIMEOUT in libcheckers/checkers.h. DEF_TIMEOUT is the scsi command timeout
in milliseconds.

Otherwise, if you are only seeing this on one server, swapping hardware seems
like a reasonable thing to try.

-Ben
 
>    Mvh / Kind regards
> 
>    Kristoffer Lippert
>    Systemansvarlig
>    JP/Politiken A/S
>    Online Magasiner
> 
>    Tlf. +45 8738 3032
>    Cell. +45 6062 8703

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Fri Jul  6 07:51:24 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Fri, 6 Jul 2007 09:51:24 +0200
Subject: SV: [linux-cluster] multipath issue... Smells of hardware issue.
In-Reply-To: <20070705200128.GC27466@ether.msp.redhat.com>
References: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk>
	<20070705200128.GC27466@ether.msp.redhat.com>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A55@exchsrv07.rootdom.dk>

Hi,

Thank you very  much for the explaination.

The hardware should under no circumstances take 5 minutes to perform a readsector. Not even when the command queue is very long.
I've tried copying files to and from the SAN, and i've tried a little program called sys_basher working the disks continously since last Friday. (almost a week) and i have not been able to reproduce the error. Before i could produce it within an hour by copying files.
I've only seen the error on one server, and i've changed nothing. (well, obvouisly something must have changed since the error seems to be gone.) 

I get a throughput of about 120mb/sec on the san using GFS1. It's fast enough for my use (wich is large files for a website). Is it far below expected throughput? 

Kind regards
Kristoffer




-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Benjamin Marzinski
Sendt: 5. juli 2007 22:01
Til: linux clustering
Emne: Re: [linux-cluster] multipath issue... Smells of hardware issue.

On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer Lippert wrote:
>    Hi,
> 
>    I have a setup with two identical RX200s3 FuSi servers talking to a SAN
>    (SX60 + extra controller), and that works fine with gfs1.
> 
>    I do however see some errors on one of the servers. It's in my message log
>    and only now and then now and then (though always under load, but i cant
>    load it and thereby force it to give the error).
> 
>    The error says:
>    Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed
>    Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector
>    705160231
>    Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 28 15:44:22 app02 multipathd: 8:16: reinstated
>    Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
>    Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed
>    Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector
>    739870727
>    Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path 8:32.
>    Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path is
>    up
>    Jun 28 15:46:06 app02 multipathd: 8:32: reinstated
>    Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    To me i looks like a fiber that bounces up and down. (There is no switch
>    involved).
> 
>    Sometimes i only get a slightly shorter version:
>    Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector
>    2782490295
>    Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed
>    Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 29 09:04:37 app02 multipathd: 8:16: reinstated
>    Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    Any sugestions, but start swapping hardware?

It's possible that your scsi device is timing out the scsi read command from the readsector0 path checker, which is what it appears that your setup is using to check the path status.  This checker has it's timeout set to 5 minutes, but I suppose that it is possible to take this long if your hardware is a flaky. If you're willing to recompile the code, you can change this default by changing DEF_TIMEOUT in libcheckers/checkers.h. DEF_TIMEOUT is the scsi command timeout in milliseconds.

Otherwise, if you are only seeing this on one server, swapping hardware seems like a reasonable thing to try.

-Ben
 
>    Mvh / Kind regards
> 
>    Kristoffer Lippert
>    Systemansvarlig
>    JP/Politiken A/S
>    Online Magasiner
> 
>    Tlf. +45 8738 3032
>    Cell. +45 6062 8703

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From manjusc13 at rediffmail.com  Fri Jul  6 09:58:44 2007
From: manjusc13 at rediffmail.com (manjunath c shanubog)
Date: 6 Jul 2007 09:58:44 -0000
Subject: [Linux-cluster] Mysql installation on Cluster
Message-ID: <1183621964.S.4215.12946.webmail51.rediffmail.com.old.1183715924.5775@webmail.rediffmail.com>

Hi Mehmet,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Thanx for ur reply !&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If it is possible i will go for active-active, otherwise active-passive.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Can u send me the steps to follow the cluster installation with fencing device and Mysql cluster installtion on the&nbsp;cluster.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Which fencing device is better ?Thanking YOuManjunath SC&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; On Thu, 05 Jul 2007 07:20:07 +0000 linux clustering wrotehii manjunath,how will you work the mysql cluster ? I know two way for the mysql-cluster.1. active-passive (failover)2. active-activeyou don\'t active-active, because it\'s not be with RHCS. for this, you have to visit mysql.com&gt;From: \"manjunath c shanubog\" &gt;Reply-To: linux clustering &gt;To: &gt;Subject: [Linux-cluster] Mysql installation on Cluster&gt;Date: 4 Jul 2!
 007 11:35:31 -0000&gt;&gt;Hi,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I need complete installation guide for &gt;installing cluster using redhat EL 5, and Mysql installtion guide on &gt;the&nbsp;cluster.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;which fencing device &gt;is better whether APC 9120&nbsp;or NPS 230 from Western telematic.Thanking &gt;YouManjunath&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt;--&gt;Linux-cluster mailing list&gt;Linux-cluster at redhat.com&gt;https://www.redhat.com/mailman/listinfo/linux-cluster_________________________________________________________________http://liveearth.msn.com--Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070706/a21de1ea/attachment.htm>

From suvankar_moitra at yahoo.com  Fri Jul  6 09:32:40 2007
From: suvankar_moitra at yahoo.com (SUVANKAR MOITRA)
Date: Fri, 6 Jul 2007 02:32:40 -0700 (PDT)
Subject: SV: [linux-cluster] multipath issue... Smells of hardware issue.
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A55@exchsrv07.rootdom.dk>
Message-ID: <328099.69094.qm@web52002.mail.re2.yahoo.com>

hi ,

Pl install the device driver in failover mode.

regards

Suvankar
--- Kristoffer Lippert <kristoffer.lippert at jppol.dk>
wrote:

> Hi,
> 
> Thank you very  much for the explaination.
> 
> The hardware should under no circumstances take 5
> minutes to perform a readsector. Not even when the
> command queue is very long.
> I've tried copying files to and from the SAN, and
> i've tried a little program called sys_basher
> working the disks continously since last Friday.
> (almost a week) and i have not been able to
> reproduce the error. Before i could produce it
> within an hour by copying files.
> I've only seen the error on one server, and i've
> changed nothing. (well, obvouisly something must
> have changed since the error seems to be gone.) 
> 
> I get a throughput of about 120mb/sec on the san
> using GFS1. It's fast enough for my use (wich is
> large files for a website). Is it far below expected
> throughput? 
> 
> Kind regards
> Kristoffer
> 
> 
> 
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] P? vegne
> af Benjamin Marzinski
> Sendt: 5. juli 2007 22:01
> Til: linux clustering
> Emne: Re: [linux-cluster] multipath issue... Smells
> of hardware issue.
> 
> On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer
> Lippert wrote:
> >    Hi,
> > 
> >    I have a setup with two identical RX200s3 FuSi
> servers talking to a SAN
> >    (SX60 + extra controller), and that works fine
> with gfs1.
> > 
> >    I do however see some errors on one of the
> servers. It's in my message log
> >    and only now and then now and then (though
> always under load, but i cant
> >    load it and thereby force it to give the
> error).
> > 
> >    The error says:
> >    Jun 28 15:44:17 app02 multipathd: 8:16: mark as
> failed
> >    Jun 28 15:44:17 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 1
> >    Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI
> error: return code =
> >    0x00070000
> >    Jun 28 15:44:17 app02 kernel: end_request: I/O
> error, dev sdb, sector
> >    705160231
> >    Jun 28 15:44:17 app02 kernel: device-mapper:
> multipath: Failing path 8:16.
> >    Jun 28 15:44:22 app02 multipathd: sdb:
> readsector0 checker reports path is
> >    up
> >    Jun 28 15:44:22 app02 multipathd: 8:16:
> reinstated
> >    Jun 28 15:44:22 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 2
> >    Jun 28 15:46:02 app02 multipathd: 8:32: mark as
> failed
> >    Jun 28 15:46:02 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 1
> >    Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI
> error: return code =
> >    0x00070000
> >    Jun 28 15:46:02 app02 kernel: end_request: I/O
> error, dev sdc, sector
> >    739870727
> >    Jun 28 15:46:02 app02 kernel: device-mapper:
> multipath: Failing path 8:32.
> >    Jun 28 15:46:06 app02 multipathd: sdc:
> readsector0 checker reports path is
> >    up
> >    Jun 28 15:46:06 app02 multipathd: 8:32:
> reinstated
> >    Jun 28 15:46:06 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 2
> > 
> >    To me i looks like a fiber that bounces up and
> down. (There is no switch
> >    involved).
> > 
> >    Sometimes i only get a slightly shorter
> version:
> >    Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI
> error: return code =
> >    0x00070000
> >    Jun 29 09:04:32 app02 kernel: end_request: I/O
> error, dev sdb, sector
> >    2782490295
> >    Jun 29 09:04:32 app02 kernel: device-mapper:
> multipath: Failing path 8:16.
> >    Jun 29 09:04:32 app02 multipathd: 8:16: mark as
> failed
> >    Jun 29 09:04:32 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 1
> >    Jun 29 09:04:37 app02 multipathd: sdb:
> readsector0 checker reports path is
> >    up
> >    Jun 29 09:04:37 app02 multipathd: 8:16:
> reinstated
> >    Jun 29 09:04:37 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 2
> > 
> >    Any sugestions, but start swapping hardware?
> 
> It's possible that your scsi device is timing out
> the scsi read command from the readsector0 path
> checker, which is what it appears that your setup is
> using to check the path status.  This checker has
> it's timeout set to 5 minutes, but I suppose that it
> is possible to take this long if your hardware is a
> flaky. If you're willing to recompile the code, you
> can change this default by changing DEF_TIMEOUT in
> libcheckers/checkers.h. DEF_TIMEOUT is the scsi
> command timeout in milliseconds.
> 
> Otherwise, if you are only seeing this on one
> server, swapping hardware seems like a reasonable
> thing to try.
> 
> -Ben
>  
> >    Mvh / Kind regards
> > 
> >    Kristoffer Lippert
> >    Systemansvarlig
> >    JP/Politiken A/S
> >    Online Magasiner
> > 
> >    Tlf. +45 8738 3032
> >    Cell. +45 6062 8703
> 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> >
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 



      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 



From dan.deshayes at algitech.com  Fri Jul  6 14:36:15 2007
From: dan.deshayes at algitech.com (Dan Deshayes)
Date: Fri, 06 Jul 2007 16:36:15 +0200
Subject: [Linux-cluster] IP Relocate Error / IP Restart error
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC761934826F4@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <F154BEC9D4278D4BA9B94BDEC761934826F4@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <468E535F.2080606@algitech.com>

Hello,
I'm bumping this question since I'm experienceing a smiliar problem.
When one of my services fails and the cluster is trying to restart it, 
the node withdraws the ip and route.
It seems that it can't setup the ip again when it has withdrawn. It can 
failover between nodes which holds
other ipnumbers though, but never back except when I manully puts back 
the ip and route.
I don't want to relocate the service just if sms-pixie fails but only to 
restart it (its stops when it looses connection to a server).
I'm using bond and my configuration looks like this:

                <resources>
                        <script file="/etc/init.d/mysqld" name="mysqld-db"/>
                        <script file="/etc/init.d/postgresql" 
name="psql-db"/>
                        <script file="/etc/init.d/sms-pixe" 
name="sms-pixie"/>
                        <ip address="<ip1>" interface="bond0" 
monitor_link="1"/>
                        <ip address="<ip2>" interface="bond0" 
monitor_link="1"/>
                </resources>
                <service autostart="1" domain="www-project1" 
name="www-project1" recovery="restart">
                        <ip ref="<ip1>"/>
                        <script ref="psql-db"/>
                </service>
                <service autostart="1" domain="www-project2" 
name="www-project2" recovery="restart">
                        <ip ref="<ip2>"/>
                        <script ref="mysqld-db"/>
                </service>

Any thoughts would be appriciated.

Regards, Dan


Robert Gil wrote:

> I have an IP address as a resource. I have the ip address in a 2 node 
> failover domain (total 4 nodes).
>  
> When i run ifconfig eth0:1 down
>  
> The service shows as stopped in clustat and the following errors show 
> in the logs
>  
> Jun  1 12:25:36 <host> clurgmgrd[5346]: <warning> #71: Relocating 
> failed service mastervip
> Jun  1 12:25:36 <host> clurgmgrd[5346]: <warning> #70: Attempting to 
> restart service mastervip locally.
> Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Recovering failed 
> service mastervip
> Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> start on 
> ip:192.168.2.100 returned 1 (generic error)
> Jun  1 12:25:37 <host> clurgmgrd[5346]: <warning> #68: Failed to start 
> mastervip; return value: 1
> Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Stopping service 
> mastervip
> Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Service mastervip is 
> stopped
>  
> The following is the resources in /etc/cluster.conf
>  
>                 <resources>
>                         <clusterfs device="/dev/mapper/mqdata-mqdata" 
> force_unmount="0" fsid="22567" fstype="gfs" mountpoint="/mqdata" 
> name="mqdata" options=""/>
>                         <ip address="192.168.2.100" interface="eth0" 
> monitor_link="1"/>
>                 </resources>
>  
> The service in /etc/cluster.conf
>  
>                 <service autostart="1" domain="mysql" exclusive="1" 
> name="mastervip" recovery="relocate">
>                         <ip ref="192.168.2.100"/>
>                 </service>
> Any ideas?
>  
> Thanks,
>
>  
> *
> *Robert Gil*
> *Linux Systems Administrator*
> *American Home Mortgage*
> *Phone: 631-622-8410*
> Cell: 631-827-5775
> *Fax: 516-495-5861*
> *
>  
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>



From mwill at penguincomputing.com  Fri Jul  6 15:54:19 2007
From: mwill at penguincomputing.com (Michael Will)
Date: Fri, 6 Jul 2007 08:54:19 -0700
Subject: SV: [linux-cluster] multipath issue... Smells of hardware issue.
Message-ID: <433093DF7AD7444DA65EFAFE3987879C33DB22@orca.penguincomputing.com>

RR

Sent from my GoodLink synchronized handheld (www.good.com)


 -----Original Message-----
From: 	SUVANKAR MOITRA [mailto:suvankar_moitra at yahoo.com]
Sent:	Friday, July 06, 2007 03:33 AM Pacific Standard Time
To:	linux clustering
Subject:	Re: SV: [linux-cluster] multipath issue... Smells of hardware issue.

hi ,

Pl install the device driver in failover mode.

regards

Suvankar
--- Kristoffer Lippert <kristoffer.lippert at jppol.dk>
wrote:

> Hi,
> 
> Thank you very  much for the explaination.
> 
> The hardware should under no circumstances take 5
> minutes to perform a readsector. Not even when the
> command queue is very long.
> I've tried copying files to and from the SAN, and
> i've tried a little program called sys_basher
> working the disks continously since last Friday.
> (almost a week) and i have not been able to
> reproduce the error. Before i could produce it
> within an hour by copying files.
> I've only seen the error on one server, and i've
> changed nothing. (well, obvouisly something must
> have changed since the error seems to be gone.) 
> 
> I get a throughput of about 120mb/sec on the san
> using GFS1. It's fast enough for my use (wich is
> large files for a website). Is it far below expected
> throughput? 
> 
> Kind regards
> Kristoffer
> 
> 
> 
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] P? vegne
> af Benjamin Marzinski
> Sendt: 5. juli 2007 22:01
> Til: linux clustering
> Emne: Re: [linux-cluster] multipath issue... Smells
> of hardware issue.
> 
> On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer
> Lippert wrote:
> >    Hi,
> > 
> >    I have a setup with two identical RX200s3 FuSi
> servers talking to a SAN
> >    (SX60 + extra controller), and that works fine
> with gfs1.
> > 
> >    I do however see some errors on one of the
> servers. It's in my message log
> >    and only now and then now and then (though
> always under load, but i cant
> >    load it and thereby force it to give the
> error).
> > 
> >    The error says:
> >    Jun 28 15:44:17 app02 multipathd: 8:16: mark as
> failed
> >    Jun 28 15:44:17 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 1
> >    Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI
> error: return code =
> >    0x00070000
> >    Jun 28 15:44:17 app02 kernel: end_request: I/O
> error, dev sdb, sector
> >    705160231
> >    Jun 28 15:44:17 app02 kernel: device-mapper:
> multipath: Failing path 8:16.
> >    Jun 28 15:44:22 app02 multipathd: sdb:
> readsector0 checker reports path is
> >    up
> >    Jun 28 15:44:22 app02 multipathd: 8:16:
> reinstated
> >    Jun 28 15:44:22 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 2
> >    Jun 28 15:46:02 app02 multipathd: 8:32: mark as
> failed
> >    Jun 28 15:46:02 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 1
> >    Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI
> error: return code =
> >    0x00070000
> >    Jun 28 15:46:02 app02 kernel: end_request: I/O
> error, dev sdc, sector
> >    739870727
> >    Jun 28 15:46:02 app02 kernel: device-mapper:
> multipath: Failing path 8:32.
> >    Jun 28 15:46:06 app02 multipathd: sdc:
> readsector0 checker reports path is
> >    up
> >    Jun 28 15:46:06 app02 multipathd: 8:32:
> reinstated
> >    Jun 28 15:46:06 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 2
> > 
> >    To me i looks like a fiber that bounces up and
> down. (There is no switch
> >    involved).
> > 
> >    Sometimes i only get a slightly shorter
> version:
> >    Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI
> error: return code =
> >    0x00070000
> >    Jun 29 09:04:32 app02 kernel: end_request: I/O
> error, dev sdb, sector
> >    2782490295
> >    Jun 29 09:04:32 app02 kernel: device-mapper:
> multipath: Failing path 8:16.
> >    Jun 29 09:04:32 app02 multipathd: 8:16: mark as
> failed
> >    Jun 29 09:04:32 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 1
> >    Jun 29 09:04:37 app02 multipathd: sdb:
> readsector0 checker reports path is
> >    up
> >    Jun 29 09:04:37 app02 multipathd: 8:16:
> reinstated
> >    Jun 29 09:04:37 app02 multipathd:
> main_disk_volume1: remaining active
> >    paths: 2
> > 
> >    Any sugestions, but start swapping hardware?
> 
> It's possible that your scsi device is timing out
> the scsi read command from the readsector0 path
> checker, which is what it appears that your setup is
> using to check the path status.  This checker has
> it's timeout set to 5 minutes, but I suppose that it
> is possible to take this long if your hardware is a
> flaky. If you're willing to recompile the code, you
> can change this default by changing DEF_TIMEOUT in
> libcheckers/checkers.h. DEF_TIMEOUT is the scsi
> command timeout in milliseconds.
> 
> Otherwise, if you are only seeing this on one
> server, swapping hardware seems like a reasonable
> thing to try.
> 
> -Ben
>  
> >    Mvh / Kind regards
> > 
> >    Kristoffer Lippert
> >    Systemansvarlig
> >    JP/Politiken A/S
> >    Online Magasiner
> > 
> >    Tlf. +45 8738 3032
> >    Cell. +45 6062 8703
> 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> >
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 



      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070706/1eb064ce/attachment.htm>

From Robert.Gil at americanhm.com  Fri Jul  6 16:08:25 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 6 Jul 2007 12:08:25 -0400
Subject: [Linux-cluster] IP Relocate Error / IP Restart error
In-Reply-To: <468E535F.2080606@algitech.com>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA80E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

I managed to figure out this problem in my environment. It had to do
with the service being set to exclusive. When a service is set to
exclusive, it is the only service allowed to run on that box. Which
means no other services, such as mysql, can run on that server. So, if
the ip is taken down, it will not start up again, because it does not
have exclusivity to the server. I have no need for exclusivity in any
way. If your running multiple services don't use it.


Robert Gil
Linux Systems Administrator
American Home Mortgage


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Deshayes
Sent: Friday, July 06, 2007 10:36 AM
To: linux clustering
Subject: Re: [Linux-cluster] IP Relocate Error / IP Restart error

Hello,
I'm bumping this question since I'm experienceing a smiliar problem.
When one of my services fails and the cluster is trying to restart it,
the node withdraws the ip and route.
It seems that it can't setup the ip again when it has withdrawn. It can
failover between nodes which holds other ipnumbers though, but never
back except when I manully puts back the ip and route.
I don't want to relocate the service just if sms-pixie fails but only to
restart it (its stops when it looses connection to a server).
I'm using bond and my configuration looks like this:

                <resources>
                        <script file="/etc/init.d/mysqld"
name="mysqld-db"/>
                        <script file="/etc/init.d/postgresql" 
name="psql-db"/>
                        <script file="/etc/init.d/sms-pixe" 
name="sms-pixie"/>
                        <ip address="<ip1>" interface="bond0" 
monitor_link="1"/>
                        <ip address="<ip2>" interface="bond0" 
monitor_link="1"/>
                </resources>
                <service autostart="1" domain="www-project1" 
name="www-project1" recovery="restart">
                        <ip ref="<ip1>"/>
                        <script ref="psql-db"/>
                </service>
                <service autostart="1" domain="www-project2" 
name="www-project2" recovery="restart">
                        <ip ref="<ip2>"/>
                        <script ref="mysqld-db"/>
                </service>

Any thoughts would be appriciated.

Regards, Dan


Robert Gil wrote:

> I have an IP address as a resource. I have the ip address in a 2 node 
> failover domain (total 4 nodes).
>  
> When i run ifconfig eth0:1 down
>  
> The service shows as stopped in clustat and the following errors show 
> in the logs
>  
> Jun  1 12:25:36 <host> clurgmgrd[5346]: <warning> #71: Relocating 
> failed service mastervip Jun  1 12:25:36 <host> clurgmgrd[5346]: 
> <warning> #70: Attempting to restart service mastervip locally.
> Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Recovering failed 
> service mastervip Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> 
> start on ip:192.168.2.100 returned 1 (generic error) Jun  1 12:25:37 
> <host> clurgmgrd[5346]: <warning> #68: Failed to start mastervip; 
> return value: 1 Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> 
> Stopping service mastervip Jun  1 12:25:37 <host> clurgmgrd[5346]: 
> <notice> Service mastervip is stopped
>  
> The following is the resources in /etc/cluster.conf
>  
>                 <resources>
>                         <clusterfs device="/dev/mapper/mqdata-mqdata" 
> force_unmount="0" fsid="22567" fstype="gfs" mountpoint="/mqdata" 
> name="mqdata" options=""/>
>                         <ip address="192.168.2.100" interface="eth0" 
> monitor_link="1"/>
>                 </resources>
>  
> The service in /etc/cluster.conf
>  
>                 <service autostart="1" domain="mysql" exclusive="1" 
> name="mastervip" recovery="relocate">
>                         <ip ref="192.168.2.100"/>
>                 </service>
> Any ideas?
>  
> Thanks,
>
>  
> *
> *Robert Gil*
> *Linux Systems Administrator*
> *American Home Mortgage*
> *Phone: 631-622-8410*
> Cell: 631-827-5775
> *Fax: 516-495-5861*
> *
>  
>
>-----------------------------------------------------------------------
>-
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Fri Jul  6 16:13:25 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:13:25 -0400
Subject: [Linux-cluster] clusvcadm hangs starting service
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA686@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <F154BEC9D4278D4BA9B94BDEC76193482BA686@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <20070706161325.GB1681@redhat.com>

On Sun, Jun 17, 2007 at 01:34:24AM -0400, Robert Gil wrote:
> I have a service for an ip which does a mysql check as a dependancy for
> failover. For some reason clusvcadm hangs when trying to either enable,
> disable, or restart that service. I get no errors in the error log and
> it just hangs. This is luckily in our test environment, but this will
> not be good in production. We use this floating ip for a couple of mysql
> servers doing replication so we always want to have the ip pointing to
> the master (rw) server.
>  
> Has anyone seen the services hanging? How can this be resolved? Should
> rgmanager be restarted? wont this fence the server?

I haven't seen a hang during start; is there some interaction that might
be making it stick?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:14:10 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:14:10 -0400
Subject: [Linux-cluster] stuck with lock usrm::rg="db"
In-Reply-To: <648d054e0706180523y23bab83dq690d6a7fe5e6c23@mail.gmail.com>
References: <648d054e0706180523y23bab83dq690d6a7fe5e6c23@mail.gmail.com>
Message-ID: <20070706161410.GC1681@redhat.com>

On Mon, Jun 18, 2007 at 02:23:13PM +0200, Simon Jolle wrote:
> Hi list
> 
> when doing a clustat, the rgmanager doesn't respond and you cant see
> the cluster resource group (after long timeout).

That sounds like a 4.4 bug which was fixed in 4.5

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:17:01 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:17:01 -0400
Subject: [Linux-cluster] Cluster service restarting
In-Reply-To: <4679BF38.6080509@flinders.edu.au>
References: <4679BF38.6080509@flinders.edu.au>
Message-ID: <20070706161701.GD1681@redhat.com>

On Thu, Jun 21, 2007 at 09:28:48AM +0930, David Schroeder wrote:

> We have found the services restart in place regularly, up to 2 or 3 
> times a day sometimes. The cause is the Failure to ping one or another 
> of the clustered service IP addresses and is evident from the log 
> entries. This happens less frequently on the database server with one 
> clustered interface than it does with the webserver that has 5. The 
> failure to ping that is reported in the logs for the webserver is not 
> always on the same IP address and it seems quite random in time and 
> which in which IP address it reports is at fault. There are no load 
> related issues as this is still in the testing stage.

> I have turned the "Monitor Link" setting off and it still happens.

If monitor link is turned off and the IP address is failing because of
the link state check (link not detected in the logs), then it's a bug.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:18:15 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:18:15 -0400
Subject: [Linux-cluster] rgmanager-2.0.24 doesn't execute script status
In-Reply-To: <20070621153732.GD15269@helsinki.fi>
References: <20070621153732.GD15269@helsinki.fi>
Message-ID: <20070706161815.GE1681@redhat.com>

On Thu, Jun 21, 2007 at 06:37:33PM +0300, Janne Peltonen wrote:
> Hi.
> 
> It seems to me that there is a bug in rgmanager 2.0.24 (at least in the
> centos build). It doesn't execute status for service scripts, even if
> there is this line in /usr/share/cluster/script.sh:
> 
>         <action name="status" interval="30s" timeout="0"/>
> 

Fixed in the 5.1 beta which should be out soon.

> P.S. Any news on the fs.sh front?

I'm sure I knew whaty you were referring to here, but I forgot ;)



-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:23:35 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:23:35 -0400
Subject: [Linux-cluster] Cluster node without access to all resources -
	trouble
In-Reply-To: <20070628184053.GD3221@helsinki.fi>
References: <20070628154537.GA3108@helsinki.fi>
	<20070628183341.GU8818@redhat.com>
	<20070628184053.GD3221@helsinki.fi>
Message-ID: <20070706162335.GF1681@redhat.com>

On Thu, Jun 28, 2007 at 09:40:54PM +0300, Janne Peltonen wrote:
> On Thu, Jun 28, 2007 at 02:33:42PM -0400, Lon Hohberger wrote:
> > (1) don't start rgmanager on the fifth node :), or
> 
> ...now there's an idea :)
> 
> > (2) if you do start rgmanager on the fifth node, make all services be
> > part of a "restricted" failover domain comprised of the other four
> > nodes.
> > 
> > > Or
> > > should I just make sure that the fifth member always comes up last (that
> > > is, won't be running while the others are coming up)? Or should I aceept
> > > that I'm going to create more harm than avoiding by letting the fifth
> > > node belong to the cluster, and just run it outside the cluster?
> > 
> > If the above two don't work, it's a bug.
> 
> If (2) means /all/ the services, even the one that should be running on
> the fifth node, it's more or less equal to (1), isn't it? That is, the
> service that I want to be running on node five can't be a clustered
> service (which is, come to think of it, exactly what I want...)

I missed the fact that there was a service you had on the fifth node.

You can either start rgmanager on the fifth node and use a 1-node
restricted domain for the service, or... start the service via init and
save a couple of MB of memory. :)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:31:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:31:19 -0400
Subject: [Linux-cluster] wishing to run GFS on iSCSI with redundancy
In-Reply-To: <426bed110706281549n70b37bfaq178c04530540c568@mail.gmail.com>
References: <C1E9042E-7DB2-41D3-84F0-7A1867F448D5@gmail.com>
	<20070628183032.GT8818@redhat.com>
	<426bed110706281549n70b37bfaq178c04530540c568@mail.gmail.com>
Message-ID: <20070706163119.GG1681@redhat.com>

On Fri, Jun 29, 2007 at 10:49:35AM +1200, Rohit Grover wrote:
> Ok, couple of notes:
> >
> >* MD is only unsafe in a cluster if it's used on multiple cluster nodes.
> >That is, it should be fairly easy to implement a resource agent which
> >assembles MD devices from network block devices - on one node at a time.
> 
> True. I would like to have MD assembling iSCSI initiators (the same set, of
> course) at multiple nodes. This will facilitate load distribution.
> Isn't it true that if MD is made to not cache any data flowing through it
> (and leave GFS to do caching and coherency control across the cluster), then
> MD should be a viable solution to putting together iSCSI initiators with
> RAID?
> 
> 
> * DRBD only will work with two writers (if 0.8.x+).  I'm not sure how many
> >mirror targets you can maintain.
> 
> 
> Could you please elaborate on this? I don't understand what is meant by
> 'DRBD will only work with two writers'. Thanks.

I think drbd only supports two-node mirrors.  I.e., you can set up a
mirror from only two nodes at a time, and that's it: no other nodes can
access the data (directly, at least).

I'm not sure how many sets of these mirror setups you can run on a given
pair of nodes, or if there is even a maximum.

> * Aren't most iSCSI targets RAID arrays already (?)
>
> Yes, they are in our case. But we also want to survive software/firmware
> failures of the iSCSI targets.
> 

Complexity has its own problems :)


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:31:52 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:31:52 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070629132556.GK29854@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
Message-ID: <20070706163152.GH1681@redhat.com>

On Fri, Jun 29, 2007 at 04:25:56PM +0300, Janne Peltonen wrote:
> Hi!
> 
> I had the trouble with fs.sh a while ago, and, well, it hasn't gone
> anywhere. Are there any news? Thanks.

I forgot what this was... could you just mail me your original email
off-list?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:33:20 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:33:20 -0400
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <802137.86357.qm@web403.biz.mail.mud.yahoo.com>
References: <802137.86357.qm@web403.biz.mail.mud.yahoo.com>
Message-ID: <20070706163319.GI1681@redhat.com>

On Fri, Jun 29, 2007 at 06:58:54AM -0700, steven holmes wrote:
> has any one tried to build a storage cluster and then build vmware on those hosts and make vm windows cluster accros the 2 hosts.

I've never done that, but I've done a cluster of Xen (Linux) guests
within a physical cluster.

There's no reason running a fully-virtual guest of Windows shouldn't
work, but I don't know how Windows cluster software works.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:34:49 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:34:49 -0400
Subject: [Linux-cluster] IP monitor failing periodically
In-Reply-To: <4686A3BF.9070609@cmiware.com>
References: <4686A3BF.9070609@cmiware.com>
Message-ID: <20070706163449.GJ1681@redhat.com>

On Sat, Jun 30, 2007 at 01:41:03PM -0500, Chris Harms wrote:
> I am experiencing periodic failovers due to a floating IP address not 
> passing the status check:
> 
> clurgmgrd: [9975]: <warning> Failed to ping 192.168.13.204
> Jun 30 11:41:47 nodeA clurgmgrd[9975]: <notice> status on ip 
> "192.168.13.204" returned 1 (generic error)
> 
> Both nodes have bonded NICs with gigabit connections to redundant 
> switches, so it is unlikely they are going down, nothing in the logs 
> about linux losing the links.  I parked all the cluster services - 2 
> Postgres services and 1 Apache - on one node and allowed it to run 
> overnight.  There would be no client activity during this time. One 
> Postgres service failed two times in this manner and the other failed 
> once in this manner.  The Apache service did not fail.
> 
> What can I do to resolve this or get more information out of the system 
> to resolve this?

Hmm, with bonded NICs, ip.sh monitors the links of the physical devices.
It's supposed to check and not complain if either link is up.

The ping bit is a bit weird; you could just disable it in
/usr/share/cluster/ip.sh.

I.e. change the 'ping' line to '/bin/true'

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:41:48 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:41:48 -0400
Subject: [Linux-cluster] Rgmanager fails to restart
In-Reply-To: <20070701111748.GA9103@helsinki.fi>
References: <20070701111748.GA9103@helsinki.fi>
Message-ID: <20070706164148.GK1681@redhat.com>

On Sun, Jul 01, 2007 at 02:17:48PM +0300, Janne Peltonen wrote:
> Hi!
> 
> Sometimes, when I have cleanly shut down rgmanager on one node, and the
> services have nicely migrated to other nodes, trying to start rgmanager
> fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
> device". clurgmgrd concludes that locks are not working and exits.
> (See strace output attached.)

That's really strange - it's almost like the DLM isn't responding to the
requests.  The open of /dev/misc/dlm_rgmanager is performed by libdlm;
rgmanager is simply opening it.  If I am not mistaken, the previous open
of /dev/misc/dlm-control followed by the write is basically saying
"yeah, that device exists".  However, the device node isn't there, so we
go to open it and it fails.

> Trying to stop cman fails:
> 
> --clip--
> [jmmpelto at pcn1 ~]$ sudo service cman restart
> Stopping cluster: 
>    Stopping fencing... done
>    Stopping cman... failed
> /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
>                                                            [FAILED]

If something happened to the dlm while rgmanager was trying to use it, I
suspect there's a chance that it could keep something held (preventing
it from shutting down).

This sounds related to an open bugzilla where rgmanager is not cleaning
up a lockspace in all cases.

In a clean shutdown, rgmanager should always be cleaning up the
lockspace.

-- Lon



From lhh at redhat.com  Fri Jul  6 16:55:10 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:55:10 -0400
Subject: [Linux-cluster] At which point is a service relocated?
In-Reply-To: <20070701121239.GE9103@helsinki.fi>
References: <20070701121239.GE9103@helsinki.fi>
Message-ID: <20070706165510.GL1681@redhat.com>

On Sun, Jul 01, 2007 at 03:12:39PM +0300, Janne Peltonen wrote:
> Hi!
> 
> It seems to me that if a node comes up that is higher in the prioritized
> failover domain of a service than the node that was running it, the
> service doesn't always relocate. Is there some documentation on this
> somewhere?
> 
> It appears that the service relocates if the difference between the
> priorities of the old and new node is at least two.

> Is there a way to modify this?

This sounds like a bug.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:56:21 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:56:21 -0400
Subject: [Linux-cluster] issues starting services
In-Reply-To: <46891943.6010202@cmiware.com>
References: <46891943.6010202@cmiware.com>
Message-ID: <20070706165621.GM1681@redhat.com>

On Mon, Jul 02, 2007 at 10:26:59AM -0500, Chris Harms wrote:
> Had a nice little hardware failure over the weekend.  After having the 
> machine come back on-line, 2 of the registered services didn't start 
> (the fact that they weren't running already is a function of services 
> not failing over until fencing succeeds).
> 
> Issuing a start operation in Conga did nothing. 
> 
> Issuing  clusvcadm -e [service] -m [node] yielded:
>    Member [node] trying to enable service:[service]...Success
>    service:[service] is now running on [node]
> This was not the case.  Nothing happened.  Nothing was logged.

Do you know what state they were in?  (stopped, started, recover,
failed, etc)?


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 16:58:10 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 12:58:10 -0400
Subject: [Linux-cluster] failover not working
In-Reply-To: <46895BC9.5010101@transolutions.net>
References: <46895BC9.5010101@transolutions.net>
Message-ID: <20070706165810.GN1681@redhat.com>

On Mon, Jul 02, 2007 at 03:10:49PM -0500, James Wilson wrote:
> Hey All,
> 
>    I was just wondering if someone could point out my errors? I 
> currently have 3 servers in a cluster server1(dolphins), server2(lions), 
> server3(patriots). server1 and server3 are being mirrored via DRBD. I 
> have set up the cluster so that if server1 fails then server3 will take 
> over. I have configured a vip to go between the 2 servers. I also do a 
> gnbd_import on this vip from server2. The problem is when ever I pull 
> the plug on server1 the vip never moves over to server3. here is a copy 
> of my cluster.conf. Any help is appreciated.

Does fencing finish? (fence of <x> succeeded) in the logs?
>                <clusternode name="server3" nodeid="3" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="nas-fencing" 
> nodename="server2"/>

Should be server3.


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 17:01:00 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 13:01:00 -0400
Subject: [Linux-cluster] failover not working
In-Reply-To: <531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com>
References: <46895BC9.5010101@transolutions.net>
	<531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com>
Message-ID: <20070706170054.GA5981@redhat.com>

On Tue, Jul 03, 2007 at 03:59:59PM -0500, Jonathan Brassow wrote:
> 
> That section looks funny.  You don't need the two 'failoverdomain's;  
> and you don't need the two services.  Does rgmanager even startup?   
> Check /var/log/messages for more info.
> 

I thought it looked fine. :)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 17:03:02 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 13:03:02 -0400
Subject: [Linux-cluster] dual fence redux
In-Reply-To: <468AD9B7.4060100@cmiware.com>
References: <468AD9B7.4060100@cmiware.com>
Message-ID: <20070706170302.GB5981@redhat.com>

On Tue, Jul 03, 2007 at 06:20:23PM -0500, Chris Harms wrote:
> To recap:
> I am attempting to setup a 2 node cluster where each will run a DB and 
> an apache service to be failed over between them.  Both are fenced via 
> Dell DRAC connected via the system NICs (this adds to the issue, but 
> manual fencing is broken).

> Thanks to a hardware issue on NodeB, I am unable to get to the logs off 
> of it presently.

It sounds like fencing was tried, but hadn't completed yet... and when
the node came back up, it tried to fence it ... again?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From chris at cmiware.com  Fri Jul  6 17:09:52 2007
From: chris at cmiware.com (Chris Harms)
Date: Fri, 06 Jul 2007 12:09:52 -0500
Subject: [Linux-cluster] dual fence redux
In-Reply-To: <20070706170302.GB5981@redhat.com>
References: <468AD9B7.4060100@cmiware.com> <20070706170302.GB5981@redhat.com>
Message-ID: <468E7760.6050207@cmiware.com>

Hard to say.  I know that only once of about 3 times has the proper node 
been fenced and not have the remaining node halt.  The survivor did not 
log anything about being fenced, just that it was being halted.

Chris

Lon Hohberger wrote:
> On Tue, Jul 03, 2007 at 06:20:23PM -0500, Chris Harms wrote:
>   
>> To recap:
>> I am attempting to setup a 2 node cluster where each will run a DB and 
>> an apache service to be failed over between them.  Both are fenced via 
>> Dell DRAC connected via the system NICs (this adds to the issue, but 
>> manual fencing is broken).
>>     
>
>   
>> Thanks to a hardware issue on NodeB, I am unable to get to the logs off 
>> of it presently.
>>     
>
> It sounds like fencing was tried, but hadn't completed yet... and when
> the node came back up, it tried to fence it ... again?
>
>   



From lhh at redhat.com  Fri Jul  6 18:29:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 14:29:30 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070706163152.GH1681@redhat.com>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
Message-ID: <20070706182930.GE5981@redhat.com>

On Fri, Jul 06, 2007 at 12:31:52PM -0400, Lon Hohberger wrote:
> On Fri, Jun 29, 2007 at 04:25:56PM +0300, Janne Peltonen wrote:
> > Hi!
> > 
> > I had the trouble with fs.sh a while ago, and, well, it hasn't gone
> > anywhere. Are there any news? Thanks.
> 
> I forgot what this was... could you just mail me your original email
> off-list?

Nevermind, I found it

https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul  6 18:31:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 6 Jul 2007 14:31:51 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070706182930.GE5981@redhat.com>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
Message-ID: <20070706183151.GF5981@redhat.com>

On Fri, Jul 06, 2007 at 02:29:30PM -0400, Lon Hohberger wrote:
> On Fri, Jul 06, 2007 at 12:31:52PM -0400, Lon Hohberger wrote:
> > On Fri, Jun 29, 2007 at 04:25:56PM +0300, Janne Peltonen wrote:
> > > Hi!
> > > 
> > > I had the trouble with fs.sh a while ago, and, well, it hasn't gone
> > > anywhere. Are there any news? Thanks.
> > 
> > I forgot what this was... could you just mail me your original email
> > off-list?
> 
> Nevermind, I found it
> 
> https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html

I think it's actually the same problem as the 'status checks' being
wrong in 2.0.24; just a different symptom.

What architecture do you have?  I can build a package for you to test if
you want.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From janne.peltonen at helsinki.fi  Fri Jul  6 18:36:59 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 6 Jul 2007 21:36:59 +0300
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070706183151.GF5981@redhat.com>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
Message-ID: <20070706183658.GA24692@helsinki.fi>

On Fri, Jul 06, 2007 at 02:31:51PM -0400, Lon Hohberger wrote:
> > > I forgot what this was... could you just mail me your original email
> > > off-list?
> > Nevermind, I found it
> > https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html
> I think it's actually the same problem as the 'status checks' being
> wrong in 2.0.24; just a different symptom.
> What architecture do you have?  I can build a package for you to test if
> you want.

x86_64

Thanks, it'd be nice.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From simanhew at gmail.com  Fri Jul  6 22:47:22 2007
From: simanhew at gmail.com (siman hew)
Date: Fri, 6 Jul 2007 18:47:22 -0400
Subject: [Linux-cluster] Update cluster.conf when there is one node left
	cluster.
Message-ID: <6596a7c70707061547j192a8f54yec77605d4de4a5ed@mail.gmail.com>

Hello,

I have one problem that I do not know it is a feature or a bug of cluster.
I have a cluster with 3 nodes, two are running, and one is stopped, on
RHEL5.
I would like to add a failover domain, and changed cluster.conf, then
ccs_tool update it.
I got error message:
Failed to receive COMM_UPDATE_COMMIT_ACK from node3. Hint: Check the log on
node3 for reason. Failed to update config file

or similar error message.
I tried on Conga, it does not work too.
I just wonder it is a bug or a feature -- we can not update cluster.conf if
there is one node is dead?

Thanks in advanced.

Siman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070706/8f38c396/attachment.htm>

From simanhew at gmail.com  Fri Jul  6 23:01:57 2007
From: simanhew at gmail.com (siman hew)
Date: Fri, 6 Jul 2007 19:01:57 -0400
Subject: [Linux-cluster] Re: Update cluster.conf when there is one node left
	cluster.
In-Reply-To: <6596a7c70707061547j192a8f54yec77605d4de4a5ed@mail.gmail.com>
References: <6596a7c70707061547j192a8f54yec77605d4de4a5ed@mail.gmail.com>
Message-ID: <6596a7c70707061601r424ba12es42555bdaeca1de78@mail.gmail.com>

The worse thing is sometimes, I got error message to say failed to update
config file, on node 1.
but check log on node2, it shows :Update of cluster.conf complete (version n
-> n+1 ).

It really confused me.


On 7/6/07, siman hew <simanhew at gmail.com> wrote:
>
> Hello,
>
> I have one problem that I do not know it is a feature or a bug of cluster.
> I have a cluster with 3 nodes, two are running, and one is stopped, on
> RHEL5.
> I would like to add a failover domain, and changed cluster.conf, then
> ccs_tool update it.
> I got error message:
> Failed to receive COMM_UPDATE_COMMIT_ACK from node3. Hint: Check the log
> on node3 for reason. Failed to update config file
>
> or similar error message.
> I tried on Conga, it does not work too.
> I just wonder it is a bug or a feature -- we can not update cluster.confif there is one node is dead?
>
> Thanks in advanced.
>
> Siman
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070706/6ad46c90/attachment.htm>

From zxvdr.au at gmail.com  Sat Jul  7 00:23:23 2007
From: zxvdr.au at gmail.com (David Robinson)
Date: Sat, 07 Jul 2007 10:23:23 +1000
Subject: [Linux-cluster] IP monitor failing periodically
In-Reply-To: <46883AAB.8000801@flinders.edu.au>
References: <4686A3BF.9070609@cmiware.com> <46883AAB.8000801@flinders.edu.au>
Message-ID: <468EDCFB.3070807@gmail.com>

David Schroeder wrote:
> Hi Chris,
> 
> I am experiencing the same problem on RHEL 5 and I have a support 
> request in with RedHat.

Hi David,

As you have already found, the problem is caused by the bug below:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=246623

in short: ping segmentation faults, which then returns a generic error. 
This causes cluster suite to wrongly think that the IP is unreachable so 
relocates/restarts the service. There's nothing wrong with the IP 
address, the network, or cluster suite.

You could possible modify /usr/share/cluster/ip.sh so that you 
workaround the problem (ie, check for return code 139 and if you get it 
try the ping again), or recompile the iputils package without the -pie 
flag - both are ugly hacks thou.

Dave



From bernard.chew at muvee.com  Sat Jul  7 07:57:28 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Sat, 7 Jul 2007 15:57:28 +0800
Subject: [Linux-cluster] CLVM
Message-ID: <229C73600EB0E54DA818AB599482BCE90181667F@shadowfax.sg.muvee.net>

Hi,

I have problems starting clvmd in a second node (after starting it
successfully on the first node) of a newly created 4-nodes cluster; no
problems staring the service first time on any nodes. Running "cman_tool
services" will show that the second node which started clvmd is in a
"join" status while the first node show a "update" status. This remains
even after a long period of time.

Given that the directories (i.e. /var /usr / ) are created using the
default lvm manager during installation, and I install the
lvm2-cluster-2.02.06-7.0.RHEL4.x86_64.rpm subsequently as part of the
requirements to set up GFS, will this cause the clvmd not to start
properly? I have no problems starting ccsd, cman and fenced.

Thanks in advance,
Bernard Chew
IT Operations Engineer




From sdake at redhat.com  Sun Jul  8 01:06:07 2007
From: sdake at redhat.com (Steven Dake)
Date: Sat, 07 Jul 2007 18:06:07 -0700
Subject: [Linux-cluster] Re: [Openais] Basic cluster not starting
In-Reply-To: <BLU105-W6B7BCBEAE55579D26261A80020@phx.gbl>
References: <BLU105-W6B7BCBEAE55579D26261A80020@phx.gbl>
Message-ID: <1183856768.3559.8.camel@balance>

James,

Let me speak with Patrick Caulfield on this topic Monday.

I have not seen this before in any of our testing, but it is possible
someone else using RHCS has.  I've also copied the linux-cluster list.

The problem appears to be, however, with something relating to ccs or
the startup order.  The opennais code doesn't know anything about the
ccsd node ids or parsing of the xml configuration file.  That work is
done by ccsd and cman.

Did you try the cman init script?

Regards
-steve

On Thu, 2007-07-05 at 14:21 -0400, james anderson wrote:
> I am attempting to install GFS on FC6 64bit using RPMs.
> Below you will find my config and steps taken to get a GFS cluster
> running.
> I am unclear if the problem is with OpenAIS or RHCS.
>  
>  
> FC6 64bit RPMs
> --------------
> rpm -ivh openais-0.80.1-3.x86_64.rpm
> rpm -ivh perl-Net-Telnet-3.03-5.noarch.rpm
> rpm -ivh cman-2.0.18-2.fc6.x86_64.rpm
> System config cluster
> rpm -ivh system-config-cluster-1.0.29-1.0.noarch.rpm
> Luci
> rpm -ivh python-imaging-1.1.6-3.fc6.x86_64.rpm
> rpm -ivh zope-2.9.7-2.fc6.x86_64.rpm
> rpm -ivh plone-2.5.3-1.fc6.x86_64.rpm
> rpm -ivh luci-0.9.3-2.fc6.x86_64.rpm
> Ricci
> rpm -ivh --nodeps oddjob-libs-0.27-8.x86_64.rpm
> rpm -ivh oddjob-0.27-8.x86_64.rpm
> rpm -ivh modcluster-0.9.3-2.fc6.x86_64.rpm
> rpm -ivh ricci-0.9.3-2.fc6.x86_64.rpm
>  
>  
> /etc/cluster/cluster.conf
> -------------------------
> <?xml version="1.0"?>
> <cluster alias="alpha_cluster" config_version="8"
> name="alpha_cluster">
>   <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>   <clusternodes>
>     <clusternode name="node1" nodeid="1" votes="1">
>       <multicast addr="239.192.196.121" interface="eth1"/>
>       <fence>
>         <method name="1">
>           <device name="nps1" port="1" switch="1"/>
>         </method>
>       </fence>
>   </clusternode>
>   <clusternode name="node2" nodeid="2" votes="1">
>     <multicast addr="239.192.196.121" interface="eth0"/>
>     <fence>
>       <method name="1">
>         <device name="nps1" port="2" switch="1"/>
>       </method>
>     </fence>
>   </clusternode>
>   <clusternode name="node3" nodeid="3" votes="1">
>   <multicast addr="239.192.196.121" interface="eth2"/>
>     <fence>
>       <method name="1">
>         <device name="nps1" port="3" switch="1"/>
>       </method>
>     </fence>
>   </clusternode>
> </clusternodes>
> <cman>
>   <multicast addr="239.192.196.121"/>
> </cman>
> <fencedevices>
>   <fencedevice agent="fence_apc" ipaddr="10.1.1.123" login="root"
> name="***" passwd="***"/>
>   </fencedevices>
>   <rm>
>     <failoverdomains/>
>     <resources/>
>   </rm>
> </cluster>
>  
>  
> Commands
> --------
> # modprobe lock_dlm
> # modprobe dlm
> # mount -t configfs non /sys/kernel/config
> # ccsd
> # cman_tool join
>  
>  
> /var/log/messages
> -----------------
> 1 Jul 2 14:50:16 node1 ccsd[22457]: Starting ccsd 2.0.18:
> 2 Jul 2 14:50:16 node1 ccsd[22457]: Built: Oct 1 2006 17:18:46
> 3 Jul 2 14:50:16 node1 ccsd[22457]: Copyright (C) Red Hat, Inc. 2004
> All rights reserved.
> 4 Jul 2 14:50:45 node1 ccsd[22457]: Unable to connect to cluster
> infrastructure after 30 seconds.
> 5 Jul 2 14:51:15 node1 ccsd[22457]: Unable to connect to cluster
> infrastructure after 60 seconds.
> 6 Jul 2 14:51:39 node1 ccsd[22457]: cluster.conf (cluster name =
> alpha_cluster, version = 6) found.
> 7 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive Service
> RELEASE 'subrev 1204 version 0.80.1'
> 8 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C) 2002-2006
> MontaVista Software, Inc and contributors.
> 9 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C) 2006 Red
> Hat, Inc.
> 10 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] No nodeid specified in
> cluster.conf
> 11 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Error reading CCS
> info, cannot start
> 12 Jul 2 14:51:41 node1 openais[22542]: [MAIN ]
> 13 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive exiting
> (-9).
> 14 Jul 2 14:51:45 node1 ccsd[22457]: Unable to connect to cluster
> infrastructure after 90 seconds.
> 15 Jul 2 14:52:15 node1 ccsd[22457]: Unable to connect to cluster
> infrastructure after 120 seconds.
> 16 Jul 2 14:52:44 node1 ccsd[22457]: Stopping ccsd, SIGTERM received.
>  
> Lines 1-6 are from running the "ccsd" command above.
> Lines 7-13 are from running the "cman_tool join" command above.
>  
> I also received the following error message:
> cman not started: CCS does not have a nodeid for this node, run
> 'ccs_tool addnodeids' to fix
> cman_tool: aisexec daemon didn't start
>  
> Yes I did try running the ccs_tool addnodeids. It did not help. Notice
> in the cluster.conf the nodeids were already in place. Any pointers to
> narrowing down my problem are appreciated.
>  
> Thanks,
> James
>  
> 
> 
> ______________________________________________________________________
> See what you?re getting into?before you go there. Check it out!
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais



From pastany at gmail.com  Sun Jul  8 03:53:22 2007
From: pastany at gmail.com (wind past)
Date: Sun, 8 Jul 2007 11:53:22 +0800
Subject: [Linux-cluster] SCSI Error
In-Reply-To: <1183384700.11507.75.camel@technetium.msp.redhat.com>
References: <200706272056059379538@gmail.com>
	<1183384700.11507.75.camel@technetium.msp.redhat.com>
Message-ID: <3feffa0b0707072053s66b86bf3vcdd16520d412b5cd@mail.gmail.com>

thanks ,i will try it later

2007/7/2, Bob Peterson <rpeterso at redhat.com>:
>
> On Mon, 2007-07-02 at 15:23 +0800, pastany wrote:
> > I am running a 4 node cluster with a fc switch and a fujitsu fc san
> > but i recevie this message ,and some partions dont work
> >
> > Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code =
> > 0x10000
> > Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector
> > 668992848
>
> Hi Pastany,
>
> This sounds like a hardware problem to me, not a GFS problem.
> It could be a bad drive, bad san or bad Host Bus Adapter (HBA).
> Perhaps you should unmount the san from all the nodes, then
> from one node, do a simple read test of the san:
>
> 1. Check dmesg and maybe clear your dmesg buffer:
>    dmesg -c
> 2. Try reading every sector of the san:
>    dd if=/dev/sdb of=/dev/null bs=1M
> 3. Check your console / dmesg to see if SCSI errors are reported.
>
> You may want to try that separately on a few different nodes just
> in case the error was caused by a bad HBA in the node that
> reported the problem.
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070708/161a7bcb/attachment.htm>

From pcaulfie at redhat.com  Mon Jul  9 10:27:56 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 09 Jul 2007 11:27:56 +0100
Subject: [Linux-cluster] Re: [Openais] Basic cluster not starting
In-Reply-To: <1183856768.3559.8.camel@balance>
References: <BLU105-W6B7BCBEAE55579D26261A80020@phx.gbl>
	<1183856768.3559.8.camel@balance>
Message-ID: <46920DAC.1050907@redhat.com>

Steven Dake wrote:
> James,
> 
> Let me speak with Patrick Caulfield on this topic Monday.
> 
> I have not seen this before in any of our testing, but it is possible
> someone else using RHCS has.  I've also copied the linux-cluster list.
> 
> The problem appears to be, however, with something relating to ccs or
> the startup order.  The opennais code doesn't know anything about the
> ccsd node ids or parsing of the xml configuration file.  That work is
> done by ccsd and cman.
> 
> Did you try the cman init script?

It looks quite odd. cman is complaining that there is no nodeid in the config
file and there clearly is, in the one you posted. Are you sure that this is the
version that CCS is farming out? It's worth stopping ccsd on all nodes, checking
that the cluster.conf file that has the nodeid entries in also has the highest
version number, and then restarting ccsd.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From pcaulfie at redhat.com  Mon Jul  9 10:44:40 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 09 Jul 2007 11:44:40 +0100
Subject: [Linux-cluster] Rgmanager fails to restart
In-Reply-To: <20070701111748.GA9103@helsinki.fi>
References: <20070701111748.GA9103@helsinki.fi>
Message-ID: <46921198.5060600@redhat.com>

It's not clear to me which kernel you're using but there is a patch that went in
a short while ago that addressed a similar problem I think.

There seems to be a mismatch in the DLM lockspace list and the device files that
make them up which that patch addresses.

So it's worth checking that your kernel has this patch included:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=241817

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From Robert.Gil at americanhm.com  Mon Jul  9 12:29:20 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Mon, 9 Jul 2007 08:29:20 -0400
Subject: [Linux-cluster] CLVM
In-Reply-To: <229C73600EB0E54DA818AB599482BCE90181667F@shadowfax.sg.muvee.net>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA818@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

What error do you get? 


Robert Gil
Linux Systems Administrator
American Home Mortgage

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew
Sent: Saturday, July 07, 2007 3:57 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CLVM

Hi,

I have problems starting clvmd in a second node (after starting it
successfully on the first node) of a newly created 4-nodes cluster; no
problems staring the service first time on any nodes. Running "cman_tool
services" will show that the second node which started clvmd is in a
"join" status while the first node show a "update" status. This remains
even after a long period of time.

Given that the directories (i.e. /var /usr / ) are created using the
default lvm manager during installation, and I install the
lvm2-cluster-2.02.06-7.0.RHEL4.x86_64.rpm subsequently as part of the
requirements to set up GFS, will this cause the clvmd not to start
properly? I have no problems starting ccsd, cman and fenced.

Thanks in advance,
Bernard Chew
IT Operations Engineer


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From pk at nodex.ru  Mon Jul  9 12:37:42 2007
From: pk at nodex.ru (Pavel D. Kuzin)
Date: Mon, 9 Jul 2007 16:37:42 +0400
Subject: [Linux-cluster] Cluster.conf full manual
References: <F154BEC9D4278D4BA9B94BDEC76193482BA818@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <160501c7c225$f191a120$0a01a8c0@office.nodex.ru>

Hello, All!

Where i can read full manuals about cluster.conf ?

--
Pavel D.Kuzin
System Administrator
Nodex  ISP
St. Petersburg, Russia
pk at nodex.ru
http://nodex.ru




From dan.deshayes at algitech.com  Mon Jul  9 14:06:40 2007
From: dan.deshayes at algitech.com (dan.deshayes at algitech.com)
Date: Mon, 9 Jul 2007 16:06:40 +0200 (MEST)
Subject: [Linux-cluster] IP Relocate Error / IP Restart error
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA80E@melpsechmbx15.HQ.CORP.AMERICAN
	HM.COM>
References: <468E535F.2080606@algitech.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA80E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <54098.10.11.12.13.1183990000.squirrel@10.11.12.13>

Hi,
thx for the reply but I'm not sure thats my problem.
I couldn't find the syntax for disabling the exclusivity (I'm not using gui)
but as far as I've understood its disabled by default. I tried with
exclusive="0" (not sure if its the right syntax though) but didn't solve
my problem.
But if the cluster was running with exclusive-mode the relocation
shouldn't work either, right?
As stated earlier the service restarts fine aslong as the node already
have an external ip.
Anyone with other ideas. maybe related to the "IP monitor failing
periodically"? but I don't have any problems running the cluster aslong as
the bond0 interface goes down, so maybe not.

Regards, Dan.

> I managed to figure out this problem in my environment. It had to do
> with the service being set to exclusive. When a service is set to
> exclusive, it is the only service allowed to run on that box. Which
> means no other services, such as mysql, can run on that server. So, if
> the ip is taken down, it will not start up again, because it does not
> have exclusivity to the server. I have no need for exclusivity in any
> way. If your running multiple services don't use it.
>
>
> Robert Gil
> Linux Systems Administrator
> American Home Mortgage
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Deshayes
> Sent: Friday, July 06, 2007 10:36 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] IP Relocate Error / IP Restart error
>
> Hello,
> I'm bumping this question since I'm experienceing a smiliar problem.
> When one of my services fails and the cluster is trying to restart it,
> the node withdraws the ip and route.
> It seems that it can't setup the ip again when it has withdrawn. It can
> failover between nodes which holds other ipnumbers though, but never
> back except when I manully puts back the ip and route.
> I don't want to relocate the service just if sms-pixie fails but only to
> restart it (its stops when it looses connection to a server).
> I'm using bond and my configuration looks like this:
>
>                 <resources>
>                         <script file="/etc/init.d/mysqld"
> name="mysqld-db"/>
>                         <script file="/etc/init.d/postgresql"
> name="psql-db"/>
>                         <script file="/etc/init.d/sms-pixe"
> name="sms-pixie"/>
>                         <ip address="<ip1>" interface="bond0"
> monitor_link="1"/>
>                         <ip address="<ip2>" interface="bond0"
> monitor_link="1"/>
>                 </resources>
>                 <service autostart="1" domain="www-project1"
> name="www-project1" recovery="restart">
>                         <ip ref="<ip1>"/>
>                         <script ref="psql-db"/>
>                 </service>
>                 <service autostart="1" domain="www-project2"
> name="www-project2" recovery="restart">
>                         <ip ref="<ip2>"/>
>                         <script ref="mysqld-db"/>
>                 </service>
>
> Any thoughts would be appriciated.
>
> Regards, Dan
>
>
> Robert Gil wrote:
>
>> I have an IP address as a resource. I have the ip address in a 2 node
>> failover domain (total 4 nodes).
>>
>> When i run ifconfig eth0:1 down
>>
>> The service shows as stopped in clustat and the following errors show
>> in the logs
>>
>> Jun  1 12:25:36 <host> clurgmgrd[5346]: <warning> #71: Relocating
>> failed service mastervip Jun  1 12:25:36 <host> clurgmgrd[5346]:
>> <warning> #70: Attempting to restart service mastervip locally.
>> Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Recovering failed
>> service mastervip Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice>
>> start on ip:192.168.2.100 returned 1 (generic error) Jun  1 12:25:37
>> <host> clurgmgrd[5346]: <warning> #68: Failed to start mastervip;
>> return value: 1 Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice>
>> Stopping service mastervip Jun  1 12:25:37 <host> clurgmgrd[5346]:
>> <notice> Service mastervip is stopped
>>
>> The following is the resources in /etc/cluster.conf
>>
>>                 <resources>
>>                         <clusterfs device="/dev/mapper/mqdata-mqdata"
>> force_unmount="0" fsid="22567" fstype="gfs" mountpoint="/mqdata"
>> name="mqdata" options=""/>
>>                         <ip address="192.168.2.100" interface="eth0"
>> monitor_link="1"/>
>>                 </resources>
>>
>> The service in /etc/cluster.conf
>>
>>                 <service autostart="1" domain="mysql" exclusive="1"
>> name="mastervip" recovery="relocate">
>>                         <ip ref="192.168.2.100"/>
>>                 </service>
>> Any ideas?
>>
>> Thanks,
>>
>>
>> *
>> *Robert Gil*
>> *Linux Systems Administrator*
>> *American Home Mortgage*
>> *Phone: 631-622-8410*
>> Cell: 631-827-5775
>> *Fax: 516-495-5861*
>> *
>>
>>
>>-----------------------------------------------------------------------
>>-
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




From bernard.chew at muvee.com  Mon Jul  9 14:46:18 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Mon, 9 Jul 2007 22:46:18 +0800
Subject: [Linux-cluster] CLVM
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA818@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <229C73600EB0E54DA818AB599482BCE90181667F@shadowfax.sg.muvee.net>
	<F154BEC9D4278D4BA9B94BDEC76193482BA818@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <229C73600EB0E54DA818AB599482BCE901816865@shadowfax.sg.muvee.net>

Hi,

Thanks for the quick reply. Here is the info that I gathered;

--------------------------------------------
Node 1 (which started CLVMD successfully):

[root at server3]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,3,0
clvmd move use event 3
clvmd recover event 3 (first)
clvmd add nodes
clvmd total nodes 1
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd recover event 3 done
clvmd move flags 0,0,1 ids 0,3,3
clvmd process held requests
clvmd processed 0 requests
clvmd recover event 3 finished
clvmd move flags 1,0,0 ids 3,3,3
clvmd move flags 0,1,0 ids 3,4,3
clvmd move use event 4
clvmd recover event 4
clvmd add node 3

[root at server3 ~]# cman_tool services
Service          Name                              GID LID State
Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 update
U-4,1,3
[1 3]

--------------------------------------------
Node 2 (which just wait forever):

[root at server2 ~]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,2,0
clvmd move use event 2
clvmd recover event 2 (first)
clvmd add nodes

[root at server2 ~]# cman_tool services
Service          Name                              GID LID State
Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 join
S-6,20,2
[1 3]
--------------------------------------------

Thanks for any help,
Bernard

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Gil
Sent: Monday, July 09, 2007 8:29 PM
To: linux clustering
Subject: RE: [Linux-cluster] CLVM

What error do you get? 


Robert Gil
Linux Systems Administrator
American Home Mortgage

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew
Sent: Saturday, July 07, 2007 3:57 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CLVM

Hi,

I have problems starting clvmd in a second node (after starting it
successfully on the first node) of a newly created 4-nodes cluster; no
problems staring the service first time on any nodes. Running "cman_tool
services" will show that the second node which started clvmd is in a
"join" status while the first node show a "update" status. This remains
even after a long period of time.

Given that the directories (i.e. /var /usr / ) are created using the
default lvm manager during installation, and I install the
lvm2-cluster-2.02.06-7.0.RHEL4.x86_64.rpm subsequently as part of the
requirements to set up GFS, will this cause the clvmd not to start
properly? I have no problems starting ccsd, cman and fenced.

Thanks in advance,
Bernard Chew
IT Operations Engineer


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From bsd_daemon at msn.com  Mon Jul  9 15:12:40 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Mon, 09 Jul 2007 15:12:40 +0000
Subject: [Linux-cluster] Re: [Openais] Basic cluster not starting
In-Reply-To: <1183856768.3559.8.camel@balance>
Message-ID: <BAY105-F21282FF0A1124278C4B866E3060@phx.gbl>


hi, can you send me openais.conf file ? or you can test openais, dont 
execute anything, only

sh # aisexec &  (as root)

if it's okay,

sh # ccsd
sh # cman_tool join

in addition to these, while the dlm module is loaded follow this steps

sh # modprobe configfs
sh # mount -t configfs none /sys/kernel/config
sh # modprobe dlm  (you must not load the dlm module before configfs, 
beacuse the dlm module is put under config)

have a nice day..


>From: Steven Dake <sdake at redhat.com>
>Reply-To: sdake at redhat.com,linux clustering <linux-cluster at redhat.com>
>To: james anderson <jamesanderson1 at hotmail.com>
>CC: openais at lists.linux-foundation.org, linux-cluster at redhat.com
>Subject: [Linux-cluster] Re: [Openais] Basic cluster not starting
>Date: Sat, 07 Jul 2007 18:06:07 -0700
>
>James,
>
>Let me speak with Patrick Caulfield on this topic Monday.
>
>I have not seen this before in any of our testing, but it is possible
>someone else using RHCS has.  I've also copied the linux-cluster list.
>
>The problem appears to be, however, with something relating to ccs or
>the startup order.  The opennais code doesn't know anything about the
>ccsd node ids or parsing of the xml configuration file.  That work is
>done by ccsd and cman.
>
>Did you try the cman init script?
>
>Regards
>-steve
>
>On Thu, 2007-07-05 at 14:21 -0400, james anderson wrote:
> > I am attempting to install GFS on FC6 64bit using RPMs.
> > Below you will find my config and steps taken to get a GFS cluster
> > running.
> > I am unclear if the problem is with OpenAIS or RHCS.
> >
> >
> > FC6 64bit RPMs
> > --------------
> > rpm -ivh openais-0.80.1-3.x86_64.rpm
> > rpm -ivh perl-Net-Telnet-3.03-5.noarch.rpm
> > rpm -ivh cman-2.0.18-2.fc6.x86_64.rpm
> > System config cluster
> > rpm -ivh system-config-cluster-1.0.29-1.0.noarch.rpm
> > Luci
> > rpm -ivh python-imaging-1.1.6-3.fc6.x86_64.rpm
> > rpm -ivh zope-2.9.7-2.fc6.x86_64.rpm
> > rpm -ivh plone-2.5.3-1.fc6.x86_64.rpm
> > rpm -ivh luci-0.9.3-2.fc6.x86_64.rpm
> > Ricci
> > rpm -ivh --nodeps oddjob-libs-0.27-8.x86_64.rpm
> > rpm -ivh oddjob-0.27-8.x86_64.rpm
> > rpm -ivh modcluster-0.9.3-2.fc6.x86_64.rpm
> > rpm -ivh ricci-0.9.3-2.fc6.x86_64.rpm
> >
> >
> > /etc/cluster/cluster.conf
> > -------------------------
> > <?xml version="1.0"?>
> > <cluster alias="alpha_cluster" config_version="8"
> > name="alpha_cluster">
> >   <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >   <clusternodes>
> >     <clusternode name="node1" nodeid="1" votes="1">
> >       <multicast addr="239.192.196.121" interface="eth1"/>
> >       <fence>
> >         <method name="1">
> >           <device name="nps1" port="1" switch="1"/>
> >         </method>
> >       </fence>
> >   </clusternode>
> >   <clusternode name="node2" nodeid="2" votes="1">
> >     <multicast addr="239.192.196.121" interface="eth0"/>
> >     <fence>
> >       <method name="1">
> >         <device name="nps1" port="2" switch="1"/>
> >       </method>
> >     </fence>
> >   </clusternode>
> >   <clusternode name="node3" nodeid="3" votes="1">
> >   <multicast addr="239.192.196.121" interface="eth2"/>
> >     <fence>
> >       <method name="1">
> >         <device name="nps1" port="3" switch="1"/>
> >       </method>
> >     </fence>
> >   </clusternode>
> > </clusternodes>
> > <cman>
> >   <multicast addr="239.192.196.121"/>
> > </cman>
> > <fencedevices>
> >   <fencedevice agent="fence_apc" ipaddr="10.1.1.123" login="root"
> > name="***" passwd="***"/>
> >   </fencedevices>
> >   <rm>
> >     <failoverdomains/>
> >     <resources/>
> >   </rm>
> > </cluster>
> >
> >
> > Commands
> > --------
> > # modprobe lock_dlm
> > # modprobe dlm
> > # mount -t configfs non /sys/kernel/config
> > # ccsd
> > # cman_tool join
> >
> >
> > /var/log/messages
> > -----------------
> > 1 Jul 2 14:50:16 node1 ccsd[22457]: Starting ccsd 2.0.18:
> > 2 Jul 2 14:50:16 node1 ccsd[22457]: Built: Oct 1 2006 17:18:46
> > 3 Jul 2 14:50:16 node1 ccsd[22457]: Copyright (C) Red Hat, Inc. 2004
> > All rights reserved.
> > 4 Jul 2 14:50:45 node1 ccsd[22457]: Unable to connect to cluster
> > infrastructure after 30 seconds.
> > 5 Jul 2 14:51:15 node1 ccsd[22457]: Unable to connect to cluster
> > infrastructure after 60 seconds.
> > 6 Jul 2 14:51:39 node1 ccsd[22457]: cluster.conf (cluster name =
> > alpha_cluster, version = 6) found.
> > 7 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive Service
> > RELEASE 'subrev 1204 version 0.80.1'
> > 8 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C) 2002-2006
> > MontaVista Software, Inc and contributors.
> > 9 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C) 2006 Red
> > Hat, Inc.
> > 10 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] No nodeid specified in
> > cluster.conf
> > 11 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Error reading CCS
> > info, cannot start
> > 12 Jul 2 14:51:41 node1 openais[22542]: [MAIN ]
> > 13 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive exiting
> > (-9).
> > 14 Jul 2 14:51:45 node1 ccsd[22457]: Unable to connect to cluster
> > infrastructure after 90 seconds.
> > 15 Jul 2 14:52:15 node1 ccsd[22457]: Unable to connect to cluster
> > infrastructure after 120 seconds.
> > 16 Jul 2 14:52:44 node1 ccsd[22457]: Stopping ccsd, SIGTERM received.
> >
> > Lines 1-6 are from running the "ccsd" command above.
> > Lines 7-13 are from running the "cman_tool join" command above.
> >
> > I also received the following error message:
> > cman not started: CCS does not have a nodeid for this node, run
> > 'ccs_tool addnodeids' to fix
> > cman_tool: aisexec daemon didn't start
> >
> > Yes I did try running the ccs_tool addnodeids. It did not help. Notice
> > in the cluster.conf the nodeids were already in place. Any pointers to
> > narrowing down my problem are appreciated.
> >
> > Thanks,
> > James
> >
> >
> >
> > ______________________________________________________________________
> > See what you???re getting into???before you go there. Check it out!
> > _______________________________________________
> > Openais mailing list
> > Openais at lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/openais
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Don't get caught with egg on your face. Play Chicktionary!? 
http://club.live.com/chicktionary.aspx?icid=chick_hotmailtextlink2



From bernard.chew at muvee.com  Mon Jul  9 15:53:59 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Mon, 9 Jul 2007 23:53:59 +0800
Subject: [Linux-cluster] CLVM
References: <229C73600EB0E54DA818AB599482BCE90181667F@shadowfax.sg.muvee.net>
	<F154BEC9D4278D4BA9B94BDEC76193482BA818@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <229C73600EB0E54DA818AB599482BCE901816870@shadowfax.sg.muvee.net>

Hi,

I should probably provide the info from "cman_tool nodes" as well (in
addition to the info below) so that the info is clearer...

[root at server2 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    3   M   server3 <- first node which started CLVMD
   3    1    3   M   server2

Regards,
Bernard

-------------------------

-----Original Message-----
From: Bernard Chew 
Sent: Monday, July 09, 2007 10:46 PM
To: 'linux clustering'
Subject: RE: [Linux-cluster] CLVM

Hi,

Thanks for the quick reply. Here is the info that I gathered;

--------------------------------------------
Node 1 (which started CLVMD successfully):

[root at server3]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,3,0
clvmd move use event 3
clvmd recover event 3 (first)
clvmd add nodes
clvmd total nodes 1
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd recover event 3 done
clvmd move flags 0,0,1 ids 0,3,3
clvmd process held requests
clvmd processed 0 requests
clvmd recover event 3 finished
clvmd move flags 1,0,0 ids 3,3,3
clvmd move flags 0,1,0 ids 3,4,3
clvmd move use event 4
clvmd recover event 4
clvmd add node 3

[root at server3 ~]# cman_tool services
Service          Name                              GID LID State
Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 update
U-4,1,3
[1 3]

--------------------------------------------
Node 2 (which just wait forever):

[root at server2 ~]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,2,0
clvmd move use event 2
clvmd recover event 2 (first)
clvmd add nodes

[root at server2 ~]# cman_tool services
Service          Name                              GID LID State
Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 join
S-6,20,2
[1 3]
--------------------------------------------

Thanks for any help,
Bernard

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Gil
Sent: Monday, July 09, 2007 8:29 PM
To: linux clustering
Subject: RE: [Linux-cluster] CLVM

What error do you get? 


Robert Gil
Linux Systems Administrator
American Home Mortgage

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew
Sent: Saturday, July 07, 2007 3:57 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CLVM

Hi,

I have problems starting clvmd in a second node (after starting it
successfully on the first node) of a newly created 4-nodes cluster; no
problems staring the service first time on any nodes. Running "cman_tool
services" will show that the second node which started clvmd is in a
"join" status while the first node show a "update" status. This remains
even after a long period of time.

Given that the directories (i.e. /var /usr / ) are created using the
default lvm manager during installation, and I install the
lvm2-cluster-2.02.06-7.0.RHEL4.x86_64.rpm subsequently as part of the
requirements to set up GFS, will this cause the clvmd not to start
properly? I have no problems starting ccsd, cman and fenced.

Thanks in advance,
Bernard Chew
IT Operations Engineer


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From bsd_daemon at msn.com  Mon Jul  9 19:18:07 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Mon, 09 Jul 2007 19:18:07 +0000
Subject: [Linux-cluster] Cluster.conf full manual
In-Reply-To: <160501c7c225$f191a120$0a01a8c0@office.nodex.ru>
Message-ID: <BAY105-F2937EE59BEB34AC2DF14F2E3060@phx.gbl>


hi, you have to the cluster.conf manual. (ccsd, ccs_tool, cluster.conf)

example

sh # man 3 cluster.conf
sh # man 5 ccs_tool

have a nice day..

>From: "Pavel D. Kuzin" <pk at nodex.ru>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>Subject: [Linux-cluster] Cluster.conf full manual
>Date: Mon, 9 Jul 2007 16:37:42 +0400
>
>Hello, All!
>
>Where i can read full manuals about cluster.conf ?
>
>--
>Pavel D.Kuzin
>System Administrator
>Nodex  ISP
>St. Petersburg, Russia
>pk at nodex.ru
>http://nodex.ru
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://im.live.com/messenger/im/home/?source=hmtextlinkjuly07



From jparsons at redhat.com  Mon Jul  9 19:24:37 2007
From: jparsons at redhat.com (James Parsons)
Date: Mon, 09 Jul 2007 15:24:37 -0400
Subject: [Linux-cluster] Cluster.conf full manual
In-Reply-To: <BAY105-F2937EE59BEB34AC2DF14F2E3060@phx.gbl>
References: <BAY105-F2937EE59BEB34AC2DF14F2E3060@phx.gbl>
Message-ID: <46928B75.1050309@redhat.com>

mehmet celik wrote:

>
> hi, you have to the cluster.conf manual. (ccsd, ccs_tool, cluster.conf)
>
> example
>
> sh # man 3 cluster.conf
> sh # man 5 ccs_tool
>
> have a nice day.. 

There is actually one other resource available:
http://sources.redhat.com/cluster/doc/cluster_schema.html
It is not fully complete, but extends the man page above.
-j

>
>
>> From: "Pavel D. Kuzin" <pk at nodex.ru>
>> Reply-To: linux clustering <linux-cluster at redhat.com>
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Subject: [Linux-cluster] Cluster.conf full manual
>> Date: Mon, 9 Jul 2007 16:37:42 +0400
>>
>> Hello, All!
>>
>> Where i can read full manuals about cluster.conf ?
>>
>> -- 
>> Pavel D.Kuzin
>> System Administrator
>> Nodex  ISP
>> St. Petersburg, Russia
>> pk at nodex.ru
>> http://nodex.ru
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> _________________________________________________________________
> http://im.live.com/messenger/im/home/?source=hmtextlinkjuly07
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster





From sdake at redhat.com  Mon Jul  9 20:11:39 2007
From: sdake at redhat.com (Steven Dake)
Date: Mon, 09 Jul 2007 13:11:39 -0700
Subject: [Linux-cluster] RE: [Openais] Basic cluster not starting
In-Reply-To: <BLU105-W426996FA5B11EDEC4A7EAA80060@phx.gbl>
References: <BLU105-W426996FA5B11EDEC4A7EAA80060@phx.gbl>
Message-ID: <1184011899.21405.10.camel@shih.broked.org>

Explain crashes whole cluster?  Could you send cman_tool nodes after
fence but before the node restarts?  (ie: fence it then unplug its power
cord or use the power gui :)

Thanks
-steve


On Mon, 2007-07-09 at 12:47 -0400, james anderson wrote:
> Steve/Patrick,
>  
> Thanks for the replies :)
>  
> I found the following FC6 x86_64 updates and applied them to all 3
> nodes:
>   rpm -ivh xen-libs-3.0.3-9.fc6.x86_64.rpm
>   rpm -ivh --nodeps libvirt-0.2.3-1.fc6.x86_64.rpm
>   rpm -ivh bridge-utils-1.1-2.x86_64.rpm
>   rpm -ivh libvirt-python-0.2.3-1.fc6.x86_64.rpm
>   rpm -ivh python-virtinst-0.95.0-1.fc6.noarch.rpm
>   rpm -ivh xen-3.0.3-9.fc6.x86_64.rpm
>   rpm -Uvh cman-2.0.60-1.fc6.x86_64.rpm
> 
> After installing these I triple checked that the cluster.conf files
> are identical.  I then rebooted them all and restarted the cman
> service.  The good news is that the basic cluster now works!  The bad
> news: fencing a node crashes the whole cluster, also conga has some
> serious problems.  I will post those in seperate emails.
>  
> Just wanted to tie up this thread for anyone else encountering the
> same problem.  If anyone has had the same experience please post so my
> findings can be confirmed.
>  
> Cheers,
> James
> 
> 
> > Subject: Re: [Openais] Basic cluster not starting
> > From: sdake at redhat.com
> > To: jamesanderson1 at hotmail.com
> > CC: openais at lists.linux-foundation.org; linux-cluster at redhat.com
> > Date: Sat, 7 Jul 2007 18:06:07 -0700
> > 
> > James,
> > 
> > Let me speak with Patrick Caulfield on this topic Monday.
> > 
> > I have not seen this before in any of our testing, but it is
> possible
> > someone else using RHCS has. I've also copied the linux-cluster
> list.
> > 
> > The problem appears to be, however, with something relating to ccs
> or
> > the startup order. The opennais code doesn't know anything about the
> > ccsd node ids or parsing of the xml configuration file. That work is
> > done by ccsd and cman.
> > 
> > Did you try the cman init script?
> > 
> > Regards
> > -steve
> > 
> > On Thu, 2007-07-05 at 14:21 -0400, james anderson wrote:
> > > I am attempting to install GFS on FC6 64bit using RPMs.
> > > Below you will find my config and steps taken to get a GFS cluster
> > > running.
> > > I am unclear if the problem is with OpenAIS or RHCS.
> > > 
> > > 
> > > FC6 64bit RPMs
> > > --------------
> > > rpm -ivh openais-0.80.1-3.x86_64.rpm
> > > rpm -ivh perl-Net-Telnet-3.03-5.noarch.rpm
> > > rpm -ivh cman-2.0.18-2.fc6.x86_64.rpm
> > > System config cluster
> > > rpm -ivh system-config-cluster-1.0.29-1.0.noarch.rpm
> > > Luci
> > > rpm -ivh python-imaging-1.1.6-3.fc6.x86_64.rpm
> > > rpm -ivh zope-2.9.7-2.fc6.x86_64.rpm
> > > rpm -ivh plone-2.5.3-1.fc6.x86_64.rpm
> > > rpm -ivh luci-0.9.3-2.fc6.x86_64.rpm
> > > Ricci
> > > rpm -ivh --nodeps oddjob-libs-0.27-8.x86_64.rpm
> > > rpm -ivh oddjob-0.27-8.x86_64.rpm
> > > rpm -ivh modcluster-0.9.3-2.fc6.x86_64.rpm
> > > rpm -ivh ricci-0.9.3-2.fc6.x86_64.rpm
> > > 
> > > 
> > > /etc/cluster/cluster.conf
> > > -------------------------
> > > <?xml version="1.0"?>
> > > <cluster alias="alpha_cluster" config_version="8"
> > > name="alpha_cluster">
> > > <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> > > <clusternodes>
> > > <clusternode name="node1" nodeid="1" votes="1">
> > > <multicast addr="239.192.196.121" interface="eth1"/>
> > > <fence>
> > > <method name="1">
> > > <device name="nps1" port="1" switch="1"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node2" nodeid="2" votes="1">
> > > <multicast addr="239.192.196.121" interface="eth0"/>
> > > <fence>
> > > <method name="1">
> > > <device name="nps1" port="2" switch="1"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node3" nodeid="3" votes="1">
> > > <multicast addr="239.192.196.121" interface="eth2"/>
> > > <fence>
> > > <method name="1">
> > > <device name="nps1" port="3" switch="1"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > </clusternodes>
> > > <cman>
> > > <multicast addr="239.192.196.121"/>
> > > </cman>
> > > <fencedevices>
> > > <fencedevice agent="fence_apc" ipaddr="10.1.1.123" login="root"
> > > name="***" passwd="***"/>
> > > </fencedevices>
> > > <rm>
> > > <failoverdomains/>
> > > <resources/>
> > > </rm>
> > > </cluster>
> > > 
> > > 
> > > Commands
> > > --------
> > > # modprobe lock_dlm
> > > # modprobe dlm
> > > # mount -t configfs non /sys/kernel/config
> > > # ccsd
> > > # cman_tool join
> > > 
> > > 
> > > /var/log/messages
> > > -----------------
> > > 1 Jul 2 14:50:16 node1 ccsd[22457]: Starting ccsd 2.0.18:
> > > 2 Jul 2 14:50:16 node1 ccsd[22457]: Built: Oct 1 2006 17:18:46
> > > 3 Jul 2 14:50:16 node1 ccsd[22457]: Copyright (C) Red Hat, Inc.
> 2004
> > > All rights reserved.
> > > 4 Jul 2 14:50:45 node1 ccsd[22457]: Unable to connect to cluster
> > > infrastructure after 30 seconds.
> > > 5 Jul 2 14:51:15 node1 ccsd[22457]: Unable to connect to cluster
> > > infrastructure after 60 seconds.
> > > 6 Jul 2 14:51:39 node1 ccsd[22457]: cluster.conf (cluster name =
> > > alpha_cluster, version = 6) found.
> > > 7 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive
> Service
> > > RELEASE 'subrev 1204 version 0.80.1'
> > > 8 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C)
> 2002-2006
> > > MontaVista Software, Inc and contributors.
> > > 9 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C) 2006
> Red
> > > Hat, Inc.
> > > 10 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] No nodeid
> specified in
> > > cluster.conf
> > > 11 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Error reading CCS
> > > info, cannot start
> > > 12 Jul 2 14:51:41 node1 openais[22542]: [MAIN ]
> > > 13 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive
> exiting
> > > (-9).
> > > 14 Jul 2 14:51:45 node1 ccsd[22457]: Unable to connect to cluster
> > > infrastructure after 90 seconds.
> > > 15 Jul 2 14:52:15 node1 ccsd[22457]: Unable to connect to cluster
> > > infrastructure after 120 seconds.
> > > 16 Jul 2 14:52:44 node1 ccsd[22457]: Stopping ccsd, SIGTERM
> received.
> > > 
> > > Lines 1-6 are from running the "ccsd" command above.
> > > Lines 7-13 are from running the "cman_tool join" command above.
> > > 
> > > I also received the following error message:
> > > cman not started: CCS does not have a nodeid for this node, run
> > > 'ccs_tool addnodeids' to fix
> > > cman_tool: aisexec daemon didn't start
> > > 
> > > Yes I did try running the ccs_tool addnodeids. It did not help.
> Notice
> > > in the cluster.conf the nodeids were already in place. Any
> pointers to
> > > narrowing down my problem are appreciated.
> > > 
> > > Thanks,
> > > James
> > > 
> > > 
> > > 
> > >
> ______________________________________________________________________
> > > See what you?re getting into?before you go there. Check it out!
> > > _______________________________________________
> > > Openais mailing list
> > > Openais at lists.linux-foundation.org
> > > https://lists.linux-foundation.org/mailman/listinfo/openais
> > 
> 
> 
> 
> ______________________________________________________________________
> Missed the show?  Watch videos of the Live Earth Concert on MSN. See
> them now!



From rabeeh at marvell.com  Mon Jul  9 21:16:07 2007
From: rabeeh at marvell.com (Rabeeh Khoury)
Date: Tue, 10 Jul 2007 00:16:07 +0300
Subject: [Linux-cluster] GFS2 for multimedia
Message-ID: <B9FFC3F97441D04093A504CEA31B7C410191CF18@msilexch01.marvell.com>

Hi All,

I would like to consult with you guys the usage GFS2 for home multimedia
usage.

Clearly GFS2 is intended for enterprise system and the following is not
where it was meant to be. But for the sake of the discussion -

The systems are 4 ARM based system every one running at 200MHz (low-end
systems) where on the master system a local SATA HDD /dev/sda and the
other 3 are connected to the master system via iSCSI running all GFS2
with DLM using Ethernet.

The details are as following -
1. Master system writes concurrently to 8 files and has 8 (files) x
10Mbps = 80Mbps (Mega bit per second) files write and 1 x 40Mbps file
read
2. The slaves has 3 (systems) x 1 (file per system) x 40Mbps = 120Mbps
total file read from the master
3. Every I/O to the files are all with O-direct, bypassing Linux cache
buffer but GFS2 mounted to cache D-entry etc...
4. There might be cases of single writer and multiple readers, and cases
of write only.
5. The slaves and master are in consumer home environment where shutting
down (gracefully or by force) is a common scenario.
6. Assume ARM CPUs have enough headroom for handling the required
workload, especially the master system.

The questions -
1. What are the risks and things can be missed in such environment?
2. Any idea if someone tried this before?
3. Other thoughts, suggestions, recommendations and ideas?

Regards,
Rabeeh




From lhh at redhat.com  Mon Jul  9 21:23:31 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 9 Jul 2007 17:23:31 -0400
Subject: [Linux-cluster] IP Relocate Error / IP Restart error
In-Reply-To: <54098.10.11.12.13.1183990000.squirrel@10.11.12.13>
References: <468E535F.2080606@algitech.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA80E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<54098.10.11.12.13.1183990000.squirrel@10.11.12.13>
Message-ID: <20070709212329.GA18076@redhat.com>

On Mon, Jul 09, 2007 at 04:06:40PM +0200, dan.deshayes at algitech.com wrote:
> Hi,
> thx for the reply but I'm not sure thats my problem.
> I couldn't find the syntax for disabling the exclusivity (I'm not using gui)
> but as far as I've understood its disabled by default. I tried with
> exclusive="0" (not sure if its the right syntax though) but didn't solve
> my problem.
> But if the cluster was running with exclusive-mode the relocation
> shouldn't work either, right?
> As stated earlier the service restarts fine aslong as the node already
> have an external ip.
> Anyone with other ideas. maybe related to the "IP monitor failing
> periodically"? but I don't have any problems running the cluster aslong as
> the bond0 interface goes down, so maybe not.

I haven't figured out the cause here, but disabling the 'ping' test
seems to fix it.

(edit ip.sh and change the 'ping' command to /bin/true or whatever)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From rwahyudi+linuxcluster at gmail.com  Tue Jul 10 00:58:03 2007
From: rwahyudi+linuxcluster at gmail.com (R Wahyudi)
Date: Tue, 10 Jul 2007 10:58:03 +1000
Subject: [Linux-cluster] GFS slower than NFS ???
Message-ID: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>

Hi All,


Before : We have 2 mail storage system which is shared using NFS over 100MB
Ethernet.
50% of user data divided equally on each storage server, and each storage
server NFS-mount the other
storage server so that it can provide  100% of data.
A number of SMTP,POP, and IMAP servers mounting the 2 storage servers using
NFS.

After:
We consolidated the storage server using  HP StorageWorks8100 EVA,
and we have 2 POP/IMAP server which mount the disk from the StorageWorks via
2GB Fiber - iSCSI.
These 2 server are GFS clustered.


To my disappointment's, the "After" setup was slower than the before.
Doing "ls -lah" on a directory with 300+ files take an average of 25
seconds,
while it took less than 1 second on previous setup.

I am quite new to GFS, and I am sure that this is something to do with my
GFS setup.
Has anyone else experience the same thing ?


Regards,
Rianto Wahyudi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070710/0c6502b9/attachment.htm>

From wcheng at redhat.com  Tue Jul 10 02:12:55 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Mon, 09 Jul 2007 22:12:55 -0400
Subject: [Linux-cluster] GFS slower than NFS ???
In-Reply-To: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>
References: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>
Message-ID: <4692EB27.50203@redhat.com>

R Wahyudi wrote:

> Before : We have 2 mail storage system which is shared using NFS over 
> 100MB Ethernet.
> 50% of user data divided equally on each storage server, and each 
> storage server NFS-mount the other
> storage server so that it can provide  100% of data.
> A number of SMTP,POP, and IMAP servers mounting the 2 storage servers 
> using NFS.
>
> After:
> We consolidated the storage server using  HP StorageWorks8100 EVA,
> and we have 2 POP/IMAP server which mount the disk from the 
> StorageWorks via 2GB Fiber - iSCSI.
> These 2 server are GFS clustered.
>
> To my disappointment's, the "After" setup was slower than the before.
> Doing "ls -lah" on a directory with 300+ files take an average of 25 
> seconds,
> while it took less than 1 second on previous setup.


The "ls -la" command is known to be a performance killer for cluster 
filesystems like GFS. It is not an GFS specific issues (a google search 
for POSIX "statlite" and "readdirplus" should give you plenty of 
examples). In general, we would like to

1. Caution users whether "ls -la" is really a good performance indicator 
for their applications.
2. Avoid having one gigantic directory holding many many small files. 
Re-structuring them into different sub-directories should see sizable 
performance improvement.

-- Wendy



From rwahyudi+linuxcluster at gmail.com  Tue Jul 10 02:55:39 2007
From: rwahyudi+linuxcluster at gmail.com (R Wahyudi)
Date: Tue, 10 Jul 2007 12:55:39 +1000
Subject: [Linux-cluster] GFS slower than NFS ???
In-Reply-To: <4692EB27.50203@redhat.com>
References: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>
	<4692EB27.50203@redhat.com>
Message-ID: <9173fd7e0707091955o8ba0ac6lb9680b1e8e8211d@mail.gmail.com>

>
>
> The "ls -la" command is known to be a performance killer for cluster
> filesystems like GFS. It is not an GFS specific issues (a google search
> for POSIX "statlite" and "readdirplus" should give you plenty of
> examples). In general, we would like to
>
> 1. Caution users whether "ls -la" is really a good performance indicator
> for their applications.
> 2. Avoid having one gigantic directory holding many many small files.
> Re-structuring them into different sub-directories should see sizable
> performance improvement.
>
> -- Wendy
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



Hi Wendy,

Thanks for your comment.
If this is the case then .. GFS or clustered storage is not the "ideal"
solutions for storage server that use Maildir ?
- Most of the time POP/IMAP jobs is to stat directory
- And users can have large number of email in a directory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070710/d7fc9dac/attachment.htm>

From wcheng at redhat.com  Tue Jul 10 04:12:44 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 10 Jul 2007 00:12:44 -0400
Subject: [Linux-cluster] GFS slower than NFS ???
In-Reply-To: <9173fd7e0707091955o8ba0ac6lb9680b1e8e8211d@mail.gmail.com>
References: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>	
	<4692EB27.50203@redhat.com>
	<9173fd7e0707091955o8ba0ac6lb9680b1e8e8211d@mail.gmail.com>
Message-ID: <4693073C.3010205@redhat.com>

R Wahyudi wrote:

> Thanks for your comment.
> If this is the case then .. GFS or clustered storage is not the 
> "ideal" solutions for storage server that use Maildir ?
> - Most of the time POP/IMAP jobs is to stat directory
> - And users can have large number of email in a directory


Yes, "stat()" and "readdir()" system calls are expensive. I personally 
don't know much about mail servers though - so really can't pass down 
good tuning tips here.

-- Wendy
 



From Arne.Brieseneck at vodafone.com  Tue Jul 10 07:15:36 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Tue, 10 Jul 2007 09:15:36 +0200
Subject: [Linux-cluster] GFS2 for multimedia
In-Reply-To: <B9FFC3F97441D04093A504CEA31B7C410191CF18@msilexch01.marvell.com>
Message-ID: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com>

Hi Rabeeh,

In your case I would not go for GFS2. What you describe sounds like a
NFS solution to me. 
You don't need the overhead of a cluster filesystem when only one system
is directly connected to the disk(s)

On the other hand it would be much easier to implement.


Best Regards,
Arne

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rabeeh Khoury
Sent: Montag, 9. Juli 2007 23:16
To: linux-cluster at redhat.com
Subject: [Linux-cluster] GFS2 for multimedia

Hi All,

I would like to consult with you guys the usage GFS2 for home multimedia
usage.

Clearly GFS2 is intended for enterprise system and the following is not
where it was meant to be. But for the sake of the discussion -

The systems are 4 ARM based system every one running at 200MHz (low-end
systems) where on the master system a local SATA HDD /dev/sda and the
other 3 are connected to the master system via iSCSI running all GFS2
with DLM using Ethernet.

The details are as following -
1. Master system writes concurrently to 8 files and has 8 (files) x
10Mbps = 80Mbps (Mega bit per second) files write and 1 x 40Mbps file
read 2. The slaves has 3 (systems) x 1 (file per system) x 40Mbps =
120Mbps total file read from the master 3. Every I/O to the files are
all with O-direct, bypassing Linux cache buffer but GFS2 mounted to
cache D-entry etc...
4. There might be cases of single writer and multiple readers, and cases
of write only.
5. The slaves and master are in consumer home environment where shutting
down (gracefully or by force) is a common scenario.
6. Assume ARM CPUs have enough headroom for handling the required
workload, especially the master system.

The questions -
1. What are the risks and things can be missed in such environment?
2. Any idea if someone tried this before?
3. Other thoughts, suggestions, recommendations and ideas?

Regards,
Rabeeh


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From rainer at ultra-secure.de  Tue Jul 10 07:47:13 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Tue, 10 Jul 2007 09:47:13 +0200
Subject: [Linux-cluster] GFS slower than NFS ???
In-Reply-To: <9173fd7e0707091955o8ba0ac6lb9680b1e8e8211d@mail.gmail.com>
References: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>	<4692EB27.50203@redhat.com>
	<9173fd7e0707091955o8ba0ac6lb9680b1e8e8211d@mail.gmail.com>
Message-ID: <46933981.2050104@ultra-secure.de>

R Wahyudi wrote:
>
> Hi Wendy,
>
> Thanks for your comment.
> If this is the case then .. GFS or clustered storage is not the
> "ideal" solutions for storage server that use Maildir ?


I do think so.

> - Most of the time POP/IMAP jobs is to stat directory
> - And users can have large number of email in a directory


What versions of RHEL/GFS are you using?

GFS is supposed to have a smaller overhead, compared to NFS.
However, I'm not sure this pays out in case a maildir-mailstorage is
clustered.
I've mentioned this before: in case of qmail as MTA, qmail itself goes
to great lengths to avoid any filename- and locking-collisions in the
maildir - it doesn't need any kind of lock-manager (GULM/DLM).
I suppose, it turns out to be counter-productive.
I don't consider "NFS" to be ideal - anectodical evidence suggests that
NFS is also very sub-optimal. It may just turn out, that it's the
lesser-evil.

I would be really interested in seeing head-to-head hard data evidence
comparing a NFS-setup (with Solaris or FreeBSD) and a GFS-setup with
several runs of postal (http://www.coker.com.au/postal/), just for kicks.
Unfortunately, I don't have the time (nor exactly the resources to do it
myself).




cheers,
Rainer



From Santosh.Panigrahi at in.unisys.com  Tue Jul 10 12:03:37 2007
From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar)
Date: Tue, 10 Jul 2007 17:33:37 +0530
Subject: [Linux-cluster] doubts in Piranha
Message-ID: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>

Hi,

 

I want to know, whether piranha supports configuring the following LVS
cluster.

 

1) Direct routing

2) Nat routing

3) Tunneling

 

Actually, I am configuring a NAT LVS cluster in piranha with one LVS
router and one Real Server. (For LVS testing purpose)

I am doing it in RHEL5 with piranha-0.8.4-7.el5.

 

LVS Router: 2 NIC cards (eth0: 10.1.40.45 - private interface, eth1:
11.1.40.45- public interface, eth1:0: 11.1.40.40- Virtual Public IP)

Real server:  1 NIC card (eth0: 10.1.40.34)

 

My piranha configuration is as follows.

 

Virtual server Tab:

 

 

Global Settings Tab:

 

 

I am configuring it to open a telnet session from a client PC, which is
not happening at all. My guess is piranha is not updating the IPVSADM
table properly as per the configuration or I have configured it wrongly.
Can somebody help me in this regard? 

 

Thanks and Regards,

Santosh

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070710/815b15aa/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 47095 bytes
Desc: image001.jpg
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070710/815b15aa/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.jpg
Type: image/jpeg
Size: 39871 bytes
Desc: image004.jpg
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070710/815b15aa/attachment-0001.jpg>

From pk at nodex.ru  Tue Jul 10 14:16:55 2007
From: pk at nodex.ru (Pavel D. Kuzin)
Date: Tue, 10 Jul 2007 18:16:55 +0400
Subject: [Linux-cluster] Strange bug
References: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com>
Message-ID: <1ad901c7c2fc$f8712420$0a01a8c0@office.nodex.ru>

This happens when i trying to mount gfs

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000018
 printing eip:
c0170dcd
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: thermal fan button processor ac battery crc32c libcrc32c iscsi_tcp libiscsi scsi_transport_iscsi ipv6 dm_snapshot 
dm_mirror lock_dlm dlm gfs lock_harness cman dm_mod loop ide_cd cdrom sd_mod generic mptspi mptscsih mptbase scsi_transport_spi 
evdev psmouse piix scsi_mod ehci_hcd parport_pc parport i2c_i801 i2c_core serio_raw ide_core floppy pcspkr e752x_edac edac_mc 
uhci_hcd shpchp pci_hotplug
CPU:    0
EIP:    0060:[<c0170dcd>]    Tainted: GF     VLI
EFLAGS: 00010293   (2.6.18 #1)
EIP is at do_add_mount+0x64/0xfb
eax: 0000000c   ebx: f69c6400   ecx: 00000003   edx: f7c0a9c0
esi: f0a77f30   edi: f3380003   ebp: 00000000   esp: f0a77e10
ds: 007b   es: 007b   ss: 0068
Process mount (pid: 3245, ti=f0a76000 task=f6010000 task.ti=f0a76000)
Stack: 00000003 f0dcf000 00000003 f3380003 00000000 c01719c0 00000000 f69c6400
       f3380000 f06d5000 00000000 00000000 c17061a0 04ff7e17 00000009 f0857006
       f7c0a4c0 fffffffe f7c0a4c0 c0170c3e fffffffe f0a77f0c c0166ec1 f0857000
Call Trace:
 [<c01719c0>] do_mount+0x5f8/0x648
 [<c0170c3e>] mntput_no_expire+0x11/0x6a
 [<c0166ec1>] link_path_walk+0xb3/0xbd
 [<c0193fb7>] nfs_clear_inode+0x31/0x54
 [<c016758b>] do_unlinkat+0xba/0x113
 [<c0167220>] do_path_lookup+0x20a/0x225
 [<c013009c>] hrtimer_run_queues+0xcf/0x157
 [<c014541f>] get_page_from_freelist+0x9b/0x368
 [<c0121514>] __do_softirq+0x5a/0xbb
 [<c0145986>] __get_free_pages+0x25/0x3e
 [<c01708a6>] copy_mount_options+0x26/0x109
 [<c0171a7d>] sys_mount+0x6d/0xaa
 [<c0102c0d>] sysenter_past_esp+0x56/0x79
Code: e0 ff ff 8b 00 8b 80 58 04 00 00 39 42 64 75 76 8b 43 14 39 42 14 75 10 8b 06 bf f0 ff ff ff 39 42 10 0f 84 86 00 00 00 8b 43 
10 <8b> 40 0c 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 6b 8b 04
EIP: [<c0170dcd>] do_add_mount+0x64/0xfb SS:ESP 0068:f0a77e10



--
Pavel D.Kuzin
System Administrator
Nodex  ISP
St. Petersburg, Russia
pk at nodex.ru
http://nodex.ru




From dan.deshayes at algitech.com  Tue Jul 10 14:50:44 2007
From: dan.deshayes at algitech.com (Dan Deshayes)
Date: Tue, 10 Jul 2007 16:50:44 +0200
Subject: [Linux-cluster] IP Relocate Error / IP Restart error
In-Reply-To: <20070709212329.GA18076@redhat.com>
References: <468E535F.2080606@algitech.com>	<F154BEC9D4278D4BA9B94BDEC76193482BA80E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>	<54098.10.11.12.13.1183990000.squirrel@10.11.12.13>
	<20070709212329.GA18076@redhat.com>
Message-ID: <46939CC4.5050305@algitech.com>

Lon Hohberger wrote:
> On Mon, Jul 09, 2007 at 04:06:40PM +0200, dan.deshayes at algitech.com wrote:
>   
>> Hi,
>> thx for the reply but I'm not sure thats my problem.
>> I couldn't find the syntax for disabling the exclusivity (I'm not using gui)
>> but as far as I've understood its disabled by default. I tried with
>> exclusive="0" (not sure if its the right syntax though) but didn't solve
>> my problem.
>> But if the cluster was running with exclusive-mode the relocation
>> shouldn't work either, right?
>> As stated earlier the service restarts fine aslong as the node already
>> have an external ip.
>> Anyone with other ideas. maybe related to the "IP monitor failing
>> periodically"? but I don't have any problems running the cluster aslong as
>> the bond0 interface goes down, so maybe not.
>>     
>
> I haven't figured out the cause here, but disabling the 'ping' test
> seems to fix it.
>
> (edit ip.sh and change the 'ping' command to /bin/true or whatever)
>
>   
I'm afraid it didn't help much.
I changed the pingcmd in the function ping_check to /bin/true restarted 
the rgmanagers but didn't work.

Here is my full configuration: http://nangilima.se/cluster.conf

I can have the full cluster running without problem, when first starting

bit when i then try to restart it with 'clusvcadm -R' it says:
Jul 10 16:22:20 asl012 clurgmgrd[412]: <notice> Stopping service 
service:www-project1
Jul 10 16:22:31 asl012 clurgmgrd[412]: <notice> Service 
service:www-project1 is stopped
Jul 10 16:22:31 asl012 clurgmgrd[412]: <notice> Starting stopped service 
service:www-project1
Jul 10 16:22:32 asl012 clurgmgrd[412]: <notice> start on ip "<external 
ip 1>" returned 1 (generic error)
Jul 10 16:22:32 asl012 clurgmgrd[412]: <warning> #68: Failed to start 
service:www-project1; return value: 1
Jul 10 16:22:32 asl012 clurgmgrd[412]: <notice> Stopping service 
service:www-project1
Jul 10 16:22:32 asl012 clurgmgrd: [412]: <err> script:psql-db: stop of 
/etc/init.d/postgresql failed (returned 1)
Jul 10 16:22:32 asl012 clurgmgrd[412]: <notice> stop on script "psql-db" 
returned 1 (generic error)
Jul 10 16:22:32 asl012 clurgmgrd[412]: <crit> #12: RG 
service:www-project1 failed to stop; intervention required
Jul 10 16:22:32 asl012 clurgmgrd[412]: <notice> Service 
service:www-project1 is failed
Jul 10 16:22:32 asl012 clurgmgrd[412]: <crit> #13: Service 
service:www-project1 failed to stop cleanly

then i disable the service and enable it on node usl001-mgmnt which 
works fine (since it got net through its own ip and route)
Jul 10 16:25:18 usl001 clurgmgrd[30130]: <notice> Starting disabled 
service service:www-project1
Jul 10 16:25:18 usl001 avahi-daemon[3533]: Registering new address 
record for <external ip 1> on bond0.
Jul 10 16:25:22 usl001 clurgmgrd[30130]: <notice> Service 
service:www-project1 started

also relocating it to node usl002-mgmnt works and then back to 
usl001-mgmnt works.
But never back to asl012-mgmnt except when i manully puts back the ip 
and route.

I'm using bond0 interface configured the following:
DEVICE=bond0
USERCTL=no
ONBOOT=yes
BROADCAST=<broadcast>
NETWORK=<network>.32
NETMASK=255.255.255.224
IPADDR=<external ip 1>
GATEWAY=<gw ip>

with slave interfaces eth0 and eth3 like this:
DEVICE=eth0 /3
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

I can supply more info if anyone wants to give it a shot.
sorry for repeting my question but i'm closing a deadline and walking 
blind ;)

Regards, Dan



From rmaureira at solint.cl  Tue Jul 10 15:23:07 2007
From: rmaureira at solint.cl (Robinson Maureira Castillo)
Date: Tue, 10 Jul 2007 11:23:07 -0400
Subject: [Linux-cluster] GFS slower than NFS ???
In-Reply-To: <46933981.2050104@ultra-secure.de>
References: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>	<4692EB27.50203@redhat.com>	<9173fd7e0707091955o8ba0ac6lb9680b1e8e8211d@mail.gmail.com>
	<46933981.2050104@ultra-secure.de>
Message-ID: <4693A45B.1080002@solint.cl>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rainer Duffner wrote:
> 
> GFS is supposed to have a smaller overhead, compared to NFS.
> However, I'm not sure this pays out in case a maildir-mailstorage is
> clustered.

In my personal experience using GFS on RHEL4 vs NFS, is that GFS
outperforms NFS on a mail system, both using maildir and mbox style
mailboxes.

The email software we're using is CommuniGate Pro, which doesn't do any
locking at filesystem level.

Under heavy use, in a 5TB (split on 5 mountpoints) filesystem, we
experienced a drop on WIO from ~90% to 60% using maildir, and then to
~45% when we switched to mbox.

One important tip with GFS is to disable quota (noquota mount flag) if
you don't need it, it saves a good amount of resources.

Best regards,
Rob.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGk6Rbu+2kmA0sEb4RAvqnAJ4k6ae/Z8mBu18VADxCKD8j1aoyFwCfYzAc
IQiVYHZTil8mNLWBnbzu8bI=
=uESK
-----END PGP SIGNATURE-----



From rafael.ferreira at apollogrp.edu  Tue Jul 10 15:39:57 2007
From: rafael.ferreira at apollogrp.edu (Rafael Ferreira)
Date: Tue, 10 Jul 2007 08:39:57 -0700
Subject: [Linux-cluster] GFS slower than NFS ???
In-Reply-To: <4693A45B.1080002@solint.cl>
Message-ID: <B51FE005A459F34DB248A429752106D001B05EEF@AMSGEV41.apollogrp.edu>

I would stay away from blanket terms like GFS being slower than NFS or
vice versa. Undoubtedly, stat calls "should" perform better on NFS with
attribute caching enabled, but I would first go down the route out
insure that the iSCSI is pumping out as much io as it should before
getting too carried away on blaming GFS. 

Actually, more info on the setup would make it easier for us to make an
educated guess on what could be causing the problem, for instance, what
do you mean by "StorageWorks via 2GB Fiber - iSCSI" is that iscsi over a
fibre Ethernet device? That alone would be an odd setup.

- raf

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robinson Maureira
Castillo
Sent: Tuesday, July 10, 2007 8:23 AM
To: linux clustering
Subject: Re: [Linux-cluster] GFS slower than NFS ???

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rainer Duffner wrote:
> 
> GFS is supposed to have a smaller overhead, compared to NFS.
> However, I'm not sure this pays out in case a maildir-mailstorage is
> clustered.

In my personal experience using GFS on RHEL4 vs NFS, is that GFS
outperforms NFS on a mail system, both using maildir and mbox style
mailboxes.

The email software we're using is CommuniGate Pro, which doesn't do any
locking at filesystem level.

Under heavy use, in a 5TB (split on 5 mountpoints) filesystem, we
experienced a drop on WIO from ~90% to 60% using maildir, and then to
~45% when we switched to mbox.

One important tip with GFS is to disable quota (noquota mount flag) if
you don't need it, it saves a good amount of resources.

Best regards,
Rob.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGk6Rbu+2kmA0sEb4RAvqnAJ4k6ae/Z8mBu18VADxCKD8j1aoyFwCfYzAc
IQiVYHZTil8mNLWBnbzu8bI=
=uESK
-----END PGP SIGNATURE-----

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From rainer at ultra-secure.de  Tue Jul 10 16:02:48 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Tue, 10 Jul 2007 18:02:48 +0200
Subject: [Linux-cluster] GFS slower than NFS ???
In-Reply-To: <4693A45B.1080002@solint.cl>
References: <9173fd7e0707091758o3a73a370x8449d42ad5d65c0e@mail.gmail.com>	<4692EB27.50203@redhat.com>	<9173fd7e0707091955o8ba0ac6lb9680b1e8e8211d@mail.gmail.com>	<46933981.2050104@ultra-secure.de>
	<4693A45B.1080002@solint.cl>
Message-ID: <4693ADA8.6000803@ultra-secure.de>

Robinson Maureira Castillo wrote:
> Rainer Duffner wrote:
> > GFS is supposed to have a smaller overhead, compared to NFS.
> > However, I'm not sure this pays out in case a maildir-mailstorage is
> > clustered.
>
> In my personal experience using GFS on RHEL4 vs NFS, is that GFS
> outperforms NFS on a mail system, both using maildir and mbox style
> mailboxes.
>

I don't know how Communigate actually delivers mail.
For qmail, the problem is that it delivers into a "tmp"-directory in the
user's Maildir first, then moving  it to "new", and then moving it to
"cur", if it was seen by the MUA.

I must admit that I don't know about GFS6.1, but with GFS6, this is
awfully slow, because everytime the mail is touched, a lock needs to be
aquired, thus multiplying the I/O needed.

I would expect a speed-up with RHEL4, but the principal problem remains,
IMO.


> The email software we're using is CommuniGate Pro, which doesn't do any
> locking at filesystem level.
>

It doesn't need to - DLM will do that. Qmail also doesn't care about
locking - it's from the old days, when locking didn't work anyway, so it
was built to work around this problem ;-)
But DLM (and GULM anyway) doesn't know that - it makes sure that no
other process will write to the file at the same time (which doesn't
happen anyway, qmail creates unique filenames for every mail), thus
wasting a lot of I/O (and slowing deliveries into the same maildir).

That, or somebody correct my assumptions....



cheers,
Rainer



From pk at nodex.ru  Tue Jul 10 17:06:21 2007
From: pk at nodex.ru (Pavel D. Kuzin)
Date: Tue, 10 Jul 2007 21:06:21 +0400
Subject: [Linux-cluster] rgmanager on debian
References: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com>
	<1ad901c7c2fc$f8712420$0a01a8c0@office.nodex.ru>
Message-ID: <1bd301c7c314$a3aec290$0a01a8c0@office.nodex.ru>

I`m trying to use rgmanager on debian.
and see strange trouble.
can you help me?

i`m trying to start httpd with rgmanager and see this in logs 
But gfs volume mount os ok.
I`m trying to understand source of  error=32 but i can`t....

Jul 10 21:03:58 node1 clurgmgrd[3652]: <notice> Starting stopped service hosting-httpd
Jul 10 21:03:58 node1 kernel: GFS: Trying to join cluster "lock_dlm", "NODEX:Hosting_data"
Jul 10 21:04:00 node1 kernel: GFS: fsid=NODEX:Hosting_data.0: Joined cluster. Now mounting FS...
Jul 10 21:04:00 node1 kernel: GFS: fsid=NODEX:Hosting_data.0: jid=0: Trying to acquire journal lock...
Jul 10 21:04:00 node1 kernel: GFS: fsid=NODEX:Hosting_data.0: jid=0: Looking at journal...
Jul 10 21:04:00 node1 kernel: GFS: fsid=NODEX:Hosting_data.0: jid=0: Done
Jul 10 21:04:01 node1 clurgmgrd: [3652]: <err> 'mount -t gfs  /dev/mapper/Hosting-TMP_Data /tmp' failed, error=32
Jul 10 21:04:01 node1 clurgmgrd[3652]: <notice> start on clusterfs:TMP_Data returned 2 (invalid argument(s))
Jul 10 21:04:01 node1 clurgmgrd[3652]: <warning> #68: Failed to start hosting-httpd; return value: 1
Jul 10 21:04:01 node1 clurgmgrd[3652]: <notice> Stopping service hosting-httpd
Jul 10 21:04:01 node1 clurgmgrd: [3652]: <info> Executing /etc/init.d/httpd stop
Jul 10 21:04:01 node1 clurgmgrd: [3652]: <info> unmounting /dev/mapper/Hosting-Hosting_data (/usr/hosting)
Jul 10 21:04:03 node1 clurgmgrd: [3652]: <info> /dev/mapper/Hosting-TMP_Data is not mounted
Jul 10 21:04:05 node1 clurgmgrd[3652]: <notice> Service hosting-httpd is recovering
Jul 10 21:04:05 node1 clurgmgrd[3652]: <warning> #71: Relocating failed service hosting-httpd
Jul 10 21:04:06 node2 clurgmgrd[7314]: <notice> Recovering failed service hosting-httpd
Jul 10 21:04:06 node2 clurgmgrd: [7314]: <err> 'mount -t gfs  /dev/mapper/Hosting-Hosting_data /usr/hosting' failed, error=32
Jul 10 21:04:06 node2 clurgmgrd[7314]: <notice> start on clusterfs:Hosting_data returned 2 (invalid argument(s))
Jul 10 21:04:06 node2 clurgmgrd[7314]: <warning> #68: Failed to start hosting-httpd; return value: 1
Jul 10 21:04:06 node2 clurgmgrd[7314]: <notice> Stopping service hosting-httpd
Jul 10 21:04:06 node2 clurgmgrd: [7314]: <info> Executing /etc/init.d/httpd stop
Jul 10 21:04:06 node2 clurgmgrd: [7314]: <info> /dev/mapper/Hosting-Hosting_data is not mounted
Jul 10 21:04:08 node2 clurgmgrd: [7314]: <info> /dev/mapper/Hosting-TMP_Data is not mounted
Jul 10 21:04:10 node2 clurgmgrd[7314]: <notice> Service hosting-httpd is recovering
Jul 10 21:04:10 node1 clurgmgrd[3652]: <notice> Stopping service hosting-httpd
Jul 10 21:04:10 node1 clurgmgrd: [3652]: <info> Executing /etc/init.d/httpd stop
Jul 10 21:04:11 node1 clurgmgrd: [3652]: <info> /dev/mapper/Hosting-Hosting_data is not mounted
Jul 10 21:04:13 node1 clurgmgrd: [3652]: <info> /dev/mapper/Hosting-TMP_Data is not mounted
Jul 10 21:04:15 node1 clurgmgrd[3652]: <notice> Service hosting-httpd is stopped

    
--
Pavel D.Kuzin
System Administrator
Nodex  ISP
St. Petersburg, Russia
pk at nodex.ru
http://nodex.ru




From jprats at cesca.es  Tue Jul 10 19:45:07 2007
From: jprats at cesca.es (Jordi Prats)
Date: Tue, 10 Jul 2007 21:45:07 +0200
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
Message-ID: <4693E1C3.5010604@cesca.es>

Maybe it's something related to the checking script? Could you paste
here your /etc/sysconfig/ha/lvs.cf?

regards,
Jordi

Panigrahi, Santosh Kumar wrote:
> Hi,
> 
>  
> 
> I want to know, whether piranha supports configuring the following LVS
> cluster.
> 
>  
> 
> 1) Direct routing
> 
> 2) Nat routing
> 
> 3) Tunneling
> 
>  
> 
> Actually, I am configuring a NAT LVS cluster in piranha with one LVS
> router and one Real Server. (For LVS testing purpose)
> 
> I am doing it in RHEL5 with piranha-0.8.4-7.el5.
> 
>  
> 
> *LVS Router:* 2 NIC cards (eth0: 10.1.40.45* *- private interface,
> *eth1: 11.1.40.45- public interface, eth1:0: 11.1.40.40- Virtual Public
> IP*)**
> 
> *Real server:*  1 NIC card (eth0: 10.1.40.34)
> 
>  
> 
> My piranha configuration is as follows.
> 
>  
> 
> *Virtual server Tab:*
> 
>  
> 
> *Global Settings Tab:*
> 
> **
> 
> * *
> 
> I am configuring it to open a telnet session from a client PC, which is
> not happening at all. My guess is piranha is not updating the IPVSADM
> table properly as per the configuration or I have configured it wrongly.
> Can somebody help me in this regard?
> 
>  
> 
> Thanks and Regards,
> 
> Santosh
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
......................................................................
        __
       / /          Jordi Prats Catal?
 C E / S / C A      Departament de Sistemes
     /_/            Centre de Supercomputaci? de Catalunya

 Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
 T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
......................................................................
pgp:0x5D0D1321
......................................................................



From sdake at redhat.com  Tue Jul 10 22:12:43 2007
From: sdake at redhat.com (Steven Dake)
Date: Tue, 10 Jul 2007 15:12:43 -0700
Subject: [Linux-cluster] RE: [Openais] Basic cluster not starting
In-Reply-To: <BLU105-W562BEAB861E9000A34E9280050@phx.gbl>
References: <BLU105-W562BEAB861E9000A34E9280050@phx.gbl>
Message-ID: <1184105563.3527.2.camel@balance>

I would be very appreciative if you could try a test RPM of openais for
me to see if it resolves your problem.

If your willing please let me know what your architecture is and I'll
build you one.

Regards
-steve
On Tue, 2007-07-10 at 18:05 -0400, james anderson wrote:
> Steve/Paul,
>  
> I am not sure why, but my emails to the linux-cluster forum have been
> getting eaten?!
>  
> In short when node 3 is shutdown the other 2 nodes lose quorum with
> each other. This seems wrong. Any ideas?
>  
> *** Steady state cluster happy***
> [root at node1 ~]# cman_tool nodes
> Node Sts Inc Joined Name
> 1 M 704 2007-07-10 13:47:51 node1
> 2 M 708 2007-07-10 13:52:54 node2
> 3 M 708 2007-07-10 13:52:54 node3
>  
> *** node 3 shutdown ***
> [root at node1 ~]# cman_tool nodes
> Node Sts Inc Joined Name
> 1 M 704 2007-07-10 13:47:51 node1
> 2 X 708 node2
> 3 X 708 node3
>  
> *** Time elapsed node 3 still down ***
> [root at node1 ~]# cman_tool nodes
> NOTE: There are 1 disallowed nodes,
> members list may seem inconsistent across the cluster
> Node Sts Inc Joined Name
> 1 M 704 2007-07-10 13:47:51 node1
> 2 d 708 2007-07-10 13:52:54 node2
> 3 X 708 node3
>  
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.20)
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] Members Joined:
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.20)
> Jul 10 13:52:54 node2 openais[3136]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 13:52:54 node2 openais[3136]: [TOTEM] entering OPERATIONAL
> state.
> Jul 10 13:52:54 node2 openais[3136]: [CMAN ] quorum regained, resuming
> activity
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] got nodejoin message
> 10.1.1.18
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] got nodejoin message
> 10.1.1.19
> Jul 10 13:52:54 node2 openais[3136]: [CLM ] got nodejoin message
> 10.1.1.20
> Jul 10 13:52:54 node2 openais[3136]: [CPG ] got joinlist message from
> node 1
> Jul 10 13:52:54 node2 openais[3136]: [CPG ] got joinlist message from
> node 2
> Jul 10 13:52:54 node2 openais[3136]: [CPG ] got joinlist message from
> node 3
> Jul 10 13:59:28 node2 openais[3136]: [TOTEM] The token was lost in the
> OPERATIONAL state.
> Jul 10 13:59:28 node2 openais[3136]: [TOTEM] Receive multicast socket
> recv buffer size (262142 bytes).
> Jul 10 13:59:28 node2 openais[3136]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> Jul 10 13:59:28 node2 openais[3136]: [TOTEM] entering GATHER state
> from 2.
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] entering GATHER state
> from 0.
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] Creating commit token
> because I am the rep.
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] Saving state aru 21 high
> seq received 21
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] Storing new sequence id
> for ring 2c8
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] entering COMMIT state.
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] entering RECOVERY state.
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] position [0] member
> 10.1.1.19:
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] previous ring seq 708 rep
> 10.1.1.18
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] aru 21 high delivered 21
> received flag 0
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] Sending initial ORF token
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] no interface found for
> nodeid
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] no interface found for
> nodeid
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] Members Joined:
> Jul 10 13:59:32 node2 openais[3136]: [CMAN ] quorum lost, blocking
> activity
> Jul 10 13:59:32 node2 openais[3136]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] Members Joined:
> Jul 10 13:59:32 node2 openais[3136]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 13:59:32 node2 openais[3136]: [TOTEM] entering OPERATIONAL
> state.
> Jul 10 13:59:32 node2 openais[3136]: [CLM ] got nodejoin message
> 10.1.1.19
> Jul 10 13:59:32 node2 openais[3136]: [CPG ] got joinlist message from
> node 2
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] entering GATHER state
> from 9.
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] Saving state aru b high
> seq received b
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] Storing new sequence id
> for ring 2cc
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] entering COMMIT state.
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] entering RECOVERY state.
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] position [0] member
> 10.1.1.18:
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] previous ring seq 712 rep
> 10.1.1.18
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] aru b high delivered b
> received flag 0
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] position [1] member
> 10.1.1.19:
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] previous ring seq 712 rep
> 10.1.1.19
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] aru b high delivered b
> received flag 0
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] Members Joined:
> Jul 10 14:02:54 node2 openais[3136]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] Members Joined:
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 14:02:54 node2 openais[3136]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:02:54 node2 openais[3136]: [TOTEM] entering OPERATIONAL
> state.
> Jul 10 14:02:54 node2 openais[3136]: [MAIN ] Node node1 not joined to
> cman because it has rejoined an inquorate cluster
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] got nodejoin message
> 10.1.1.18
> Jul 10 14:02:54 node2 openais[3136]: [CLM ] got nodejoin message
> 10.1.1.19
> Jul 10 14:02:54 node2 openais[3136]: [CPG ] got joinlist message from
> node 1
> Jul 10 14:02:54 node2 openais[3136]: [CPG ] got joinlist message from
> node 2
>  
> *** node 3 back up ***
> [root at node1 init.d]# cman_tool nodes
> Node Sts Inc Joined Name
> 1 M 704 2007-07-10 13:47:51 node1
> 2 X 708 node2
> 3 X 708 node3
>  
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] The consensus timeout
> expired.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] entering GATHER state
> from 0.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] entering GATHER state
> from 3.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] Creating commit token
> because I am the rep.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] Saving state aru 16 high
> seq received 16
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] Storing new sequence id
> for ring 2d0
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] entering COMMIT state.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] entering RECOVERY state.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] position [0] member
> 10.1.1.19:
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] previous ring seq 716 rep
> 10.1.1.18
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] aru 16 high delivered 16
> received flag 0
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] Sending initial ORF token
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] no interface found for
> nodeid
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] Members Joined:
> Jul 10 14:13:09 node2 openais[3136]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] Members Joined:
> Jul 10 14:13:09 node2 openais[3136]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:13:09 node2 openais[3136]: [TOTEM] entering OPERATIONAL
> state.
> Jul 10 14:13:09 node2 openais[3136]: [CLM ] got nodejoin message
> 10.1.1.19
> Jul 10 14:13:09 node2 openais[3136]: [CPG ] got joinlist message from
> node 2
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] entering GATHER state
> from 9.
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] entering GATHER state
> from 11.
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] Saving state aru b high
> seq received b
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] Storing new sequence id
> for ring 2d4
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] entering COMMIT state.
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] entering RECOVERY state.
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] position [0] member
> 10.1.1.18:
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] previous ring seq 720 rep
> 10.1.1.18
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] aru b high delivered b
> received flag 0
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] position [1] member
> 10.1.1.19:
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] previous ring seq 720 rep
> 10.1.1.19
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] aru b high delivered b
> received flag 0
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] position [2] member
> 10.1.1.20:
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] previous ring seq 720 rep
> 10.1.1.20
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] aru b high delivered b
> received flag 0
> Jul 10 14:22:54 node2 openais[3136]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jul 10 14:22:54 node2 openais[3136]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:22:54 node2 openais[3136]: [CLM ] New Configuration:
> Jul 10 14:22:54 node2 openais[3136]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 14:22:54 node2 openais[3136]: [CLM ] Members Left:
> Jul 10 14:22:54 node2 gfs_controld[3164]: groupd_dispatch error -1
> errno 11
> Jul 10 14:22:54 node2 gfs_controld[3164]: groupd connection died
> Jul 10 14:22:54 node2 gfs_controld[3164]: cluster is down, exiting
> Jul 10 14:22:54 node2 dlm_controld[3158]: groupd is down, exiting
> Jul 10 14:23:20 node2 ccsd[3130]: Unable to connect to cluster
> infrastructure after 30 seconds.
> Jul 10 14:23:50 node2 ccsd[3130]: Unable to connect to cluster
> infrastructure after 60 seconds.
>  
> *** node2 cman crashed ***
> [root at node1 init.d]# cman_tool nodes
> Node Sts Inc Joined Name
> 1 M 704 2007-07-10 13:47:51 node1
> 2 X 708 node2
> 3 X 724 node3
>  
> [root at node2 init.d]# cman_tool nodes
> cman_tool: Cannot open connection to cman, is it running ?
>  
> [root at node3 init.d]# cman_tool nodes
> Node Sts Inc Joined Name
> 1 X 724 node1
> 2 X 724 node2
> 3 M 712 2007-07-10 14:07:13 node3
>  
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] New Configuration:
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] Members Left:
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] Members Joined:
> Jul 10 14:42:55 node1 openais[3166]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] New Configuration:
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.20)
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] Members Left:
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] Members Joined:
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.19)
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.20)
> Jul 10 14:42:55 node1 openais[3166]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:42:55 node1 openais[3166]: [TOTEM] entering OPERATIONAL
> state.
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] got nodejoin message
> 10.1.1.18
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] got nodejoin message
> 10.1.1.19
> Jul 10 14:42:55 node1 openais[3166]: [CLM ] got nodejoin message
> 10.1.1.20
> Jul 10 14:42:55 node1 openais[3166]: [TOTEM] Retransmit List: c
> Jul 10 14:42:55 node1 openais[3166]: [TOTEM] Retransmit List: c
> Jul 10 14:42:55 node1 openais[3166]: [TOTEM] Retransmit List: c d
> Jul 10 14:43:01 node1 last message repeated 47 times
> Jul 10 14:43:21 node1 openais[3166]: [TOTEM] The token was lost in the
> OPERATIONAL state.
> Jul 10 14:43:21 node1 openais[3166]: [TOTEM] Receive multicast socket
> recv buffer size (262142 bytes).
> Jul 10 14:43:21 node1 openais[3166]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> Jul 10 14:43:21 node1 openais[3166]: [TOTEM] entering GATHER state
> from 2.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] entering GATHER state
> from 0.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] Creating commit token
> because I am the rep.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] Saving state aru b high
> seq received d
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] Storing new sequence id
> for ring 2ec
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] entering COMMIT state.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] entering RECOVERY state.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] position [0] member
> 10.1.1.18:
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] previous ring seq 744 rep
> 10.1.1.18
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] aru b high delivered b
> received flag 0
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] copying all old ring
> messages from c-d.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] Originated 0 messages in
> RECOVERY.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] Originated for recovery:
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] Not Originated for
> recovery: c d
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] Sending initial ORF token
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] New Configuration:
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] Members Left:
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] no interface found for
> nodeid
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] no interface found for
> nodeid
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] Members Joined:
> Jul 10 14:43:26 node1 openais[3166]: [CMAN ] quorum lost, blocking
> activity
> Jul 10 14:43:26 node1 openais[3166]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] CLM CONFIGURATION CHANGE
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] New Configuration:
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] r(0) ip(10.1.1.18)
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] Members Left:
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] Members Joined:
> Jul 10 14:43:26 node1 openais[3166]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 10 14:43:26 node1 openais[3166]: [TOTEM] entering OPERATIONAL
> state.
> Jul 10 14:43:26 node1 openais[3166]: [CLM ] got nodejoin message
> 10.1.1.18
> Jul 10 14:43:26 node1 openais[3166]: [CPG ] got joinlist message from
> node 1
>  
>  
>  
> Let me know what else I can do to narrow this problem down.
>  
> Thank you for the help :)
> James
> 
> 
> > Subject: RE: [Openais] Basic cluster not starting
> > From: sdake at redhat.com
> > To: jamesanderson1 at hotmail.com
> > CC: openais at lists.linux-foundation.org; linux-cluster at redhat.com
> > Date: Mon, 9 Jul 2007 13:11:39 -0700
> > 
> > Explain crashes whole cluster? Could you send cman_tool nodes after
> > fence but before the node restarts? (ie: fence it then unplug its
> power
> > cord or use the power gui :)
> > 
> > Thanks
> > -steve
> > 
> > 
> > On Mon, 2007-07-09 at 12:47 -0400, james anderson wrote:
> > > Steve/Patrick,
> > > 
> > > Thanks for the replies :)
> > > 
> > > I found the following FC6 x86_64 updates and applied them to all 3
> > > nodes:
> > > rpm -ivh xen-libs-3.0.3-9.fc6.x86_64.rpm
> > > rpm -ivh --nodeps libvirt-0.2.3-1.fc6.x86_64.rpm
> > > rpm -ivh bridge-utils-1.1-2.x86_64.rpm
> > > rpm -ivh libvirt-python-0.2.3-1.fc6.x86_64.rpm
> > > rpm -ivh python-virtinst-0.95.0-1.fc6.noarch.rpm
> > > rpm -ivh xen-3.0.3-9.fc6.x86_64.rpm
> > > rpm -Uvh cman-2.0.60-1.fc6.x86_64.rpm
> > > 
> > > After installing these I triple checked that the cluster.conf
> files
> > > are identical. I then rebooted them all and restarted the cman
> > > service. The good news is that the basic cluster now works! The
> bad
> > > news: fencing a node crashes the whole cluster, also conga has
> some
> > > serious problems. I will post those in seperate emails.
> > > 
> > > Just wanted to tie up this thread for anyone else encountering the
> > > same problem. If anyone has had the same experience please post so
> my
> > > findings can be confirmed.
> > > 
> > > Cheers,
> > > James
> > > 
> > > 
> > > > Subject: Re: [Openais] Basic cluster not starting
> > > > From: sdake at redhat.com
> > > > To: jamesanderson1 at hotmail.com
> > > > CC: openais at lists.linux-foundation.org; linux-cluster at redhat.com
> > > > Date: Sat, 7 Jul 2007 18:06:07 -0700
> > > > 
> > > > James,
> > > > 
> > > > Let me speak with Patrick Caulfield on this topic Monday.
> > > > 
> > > > I have not seen this before in any of our testing, but it is
> > > possible
> > > > someone else using RHCS has. I've also copied the linux-cluster
> > > list.
> > > > 
> > > > The problem appears to be, however, with something relating to
> ccs
> > > or
> > > > the startup order. The opennais code doesn't know anything about
> the
> > > > ccsd node ids or parsing of the xml configuration file. That
> work is
> > > > done by ccsd and cman.
> > > > 
> > > > Did you try the cman init script?
> > > > 
> > > > Regards
> > > > -steve
> > > > 
> > > > On Thu, 2007-07-05 at 14:21 -0400, james anderson wrote:
> > > > > I am attempting to install GFS on FC6 64bit using RPMs.
> > > > > Below you will find my config and steps taken to get a GFS
> cluster
> > > > > running.
> > > > > I am unclear if the problem is with OpenAIS or RHCS.
> > > > > 
> > > > > 
> > > > > FC6 64bit RPMs
> > > > > --------------
> > > > > rpm -ivh openais-0.80.1-3.x86_64.rpm
> > > > > rpm -ivh perl-Net-Telnet-3.03-5.noarch.rpm
> > > > > rpm -ivh cman-2.0.18-2.fc6.x86_64.rpm
> > > > > System config cluster
> > > > > rpm -ivh system-config-cluster-1.0.29-1.0.noarch.rpm
> > > > > Luci
> > > > > rpm -ivh python-imaging-1.1.6-3.fc6.x86_64.rpm
> > > > > rpm -ivh zope-2.9.7-2.fc6.x86_64.rpm
> > > > > rpm -ivh plone-2.5.3-1.fc6.x86_64.rpm
> > > > > rpm -ivh luci-0.9.3-2.fc6.x86_64.rpm
> > > > > Ricci
> > > > > rpm -ivh --nodeps oddjob-libs-0.27-8.x86_64.rpm
> > > > > rpm -ivh oddjob-0.27-8.x86_64.rpm
> > > > > rpm -ivh modcluster-0.9.3-2.fc6.x86_64.rpm
> > > > > rpm -ivh ricci-0.9.3-2.fc6.x86_64.rpm
> > > > > 
> > > > > 
> > > > > /etc/cluster/cluster.conf
> > > > > -------------------------
> > > > > <?xml version="1.0"?>
> > > > > <cluster alias="alpha_cluster" config_version="8"
> > > > > name="alpha_cluster">
> > > > > <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> > > > > <clusternodes>
> > > > > <clusternode name="node1" nodeid="1" votes="1">
> > > > > <multicast addr="239.192.196.121" interface="eth1"/>
> > > > > <fence>
> > > > > <method name="1">
> > > > > <device name="nps1" port="1" switch="1"/>
> > > > > </method>
> > > > > </fence>
> > > > > </clusternode>
> > > > > <clusternode name="node2" nodeid="2" votes="1">
> > > > > <multicast addr="239.192.196.121" interface="eth0"/>
> > > > > <fence>
> > > > > <method name="1">
> > > > > <device name="nps1" port="2" switch="1"/>
> > > > > </method>
> > > > > </fence>
> > > > > </clusternode>
> > > > > <clusternode name="node3" nodeid="3" votes="1">
> > > > > <multicast addr="239.192.196.121" interface="eth2"/>
> > > > > <fence>
> > > > > <method name="1">
> > > > > <device name="nps1" port="3" switch="1"/>
> > > > > </method>
> > > > > </fence>
> > > > > </clusternode>
> > > > > </clusternodes>
> > > > > <cman>
> > > > > <multicast addr="239.192.196.121"/>
> > > > > </cman>
> > > > > <fencedevices>
> > > > > <fencedevice agent="fence_apc" ipaddr="10.1.1.123"
> login="root"
> > > > > name="***" passwd="***"/>
> > > > > </fencedevices>
> > > > > <rm>
> > > > > <failoverdomains/>
> > > > > <resources/>
> > > > > </rm>
> > > > > </cluster>
> > > > > 
> > > > > 
> > > > > Commands
> > > > > --------
> > > > > # modprobe lock_dlm
> > > > > # modprobe dlm
> > > > > # mount -t configfs non /sys/kernel/config
> > > > > # ccsd
> > > > > # cman_tool join
> > > > > 
> > > > > 
> > > > > /var/log/messages
> > > > > -----------------
> > > > > 1 Jul 2 14:50:16 node1 ccsd[22457]: Starting ccsd 2.0.18:
> > > > > 2 Jul 2 14:50:16 node1 ccsd[22457]: Built: Oct 1 2006 17:18:46
> > > > > 3 Jul 2 14:50:16 node1 ccsd[22457]: Copyright (C) Red Hat,
> Inc.
> > > 2004
> > > > > All rights reserved.
> > > > > 4 Jul 2 14:50:45 node1 ccsd[22457]: Unable to connect to
> cluster
> > > > > infrastructure after 30 seconds.
> > > > > 5 Jul 2 14:51:15 node1 ccsd[22457]: Unable to connect to
> cluster
> > > > > infrastructure after 60 seconds.
> > > > > 6 Jul 2 14:51:39 node1 ccsd[22457]: cluster.conf (cluster name
> =
> > > > > alpha_cluster, version = 6) found.
> > > > > 7 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive
> > > Service
> > > > > RELEASE 'subrev 1204 version 0.80.1'
> > > > > 8 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C)
> > > 2002-2006
> > > > > MontaVista Software, Inc and contributors.
> > > > > 9 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Copyright (C)
> 2006
> > > Red
> > > > > Hat, Inc.
> > > > > 10 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] No nodeid
> > > specified in
> > > > > cluster.conf
> > > > > 11 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] Error reading
> CCS
> > > > > info, cannot start
> > > > > 12 Jul 2 14:51:41 node1 openais[22542]: [MAIN ]
> > > > > 13 Jul 2 14:51:41 node1 openais[22542]: [MAIN ] AIS Executive
> > > exiting
> > > > > (-9).
> > > > > 14 Jul 2 14:51:45 node1 ccsd[22457]: Unable to connect to
> cluster
> > > > > infrastructure after 90 seconds.
> > > > > 15 Jul 2 14:52:15 node1 ccsd[22457]: Unable to connect to
> cluster
> > > > > infrastructure after 120 seconds.
> > > > > 16 Jul 2 14:52:44 node1 ccsd[22457]: Stopping ccsd, SIGTERM
> > > received.
> > > > > 
> > > > > Lines 1-6 are from running the "ccsd" command above.
> > > > > Lines 7-13 are from running the "cman_tool join" command
> above.
> > > > > 
> > > > > I also received the following error message:
> > > > > cman not started: CCS does not have a nodeid for this node,
> run
> > > > > 'ccs_tool addnodeids' to fix
> > > > > cman_tool: aisexec daemon didn't start
> > > > > 
> > > > > Yes I did try running the ccs_tool addnodeids. It did not
> help.
> > > Notice
> > > > > in the cluster.conf the nodeids were already in place. Any
> > > pointers to
> > > > > narrowing down my problem are appreciated.
> > > > > 
> > > > > Thanks,
> > > > > James
> > > > > 
> > > > > 
> > > > > 
> > > > >
> > >
> ______________________________________________________________________
> > > > > See what you?re getting into?before you go there. Check it
> out!
> > > > > _______________________________________________
> > > > > Openais mailing list
> > > > > Openais at lists.linux-foundation.org
> > > > > https://lists.linux-foundation.org/mailman/listinfo/openais
> > > > 
> > > 
> > > 
> > > 
> > >
> ______________________________________________________________________
> > > Missed the show? Watch videos of the Live Earth Concert on MSN.
> See
> > > them now!
> > 
> 
> 
> 
> ______________________________________________________________________
> PC Magazine?s 2007 editors? choice for best web mail?award-winning
> Windows Live Hotmail. Check it out!



From lhh at redhat.com  Tue Jul 10 22:19:22 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 10 Jul 2007 18:19:22 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070706183658.GA24692@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
Message-ID: <20070710221922.GG18076@redhat.com>

On Fri, Jul 06, 2007 at 09:36:59PM +0300, Janne Peltonen wrote:
> On Fri, Jul 06, 2007 at 02:31:51PM -0400, Lon Hohberger wrote:
> > > > I forgot what this was... could you just mail me your original email
> > > > off-list?
> > > Nevermind, I found it
> > > https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html
> > I think it's actually the same problem as the 'status checks' being
> > wrong in 2.0.24; just a different symptom.
> > What architecture do you have?  I can build a package for you to test if
> > you want.
> 
> x86_64
> 
> Thanks, it'd be nice.

http://people.redhat.com/lhh/rhel5-test

You'll need at least the updated cman package.  The -2.1lhh build of
rgmanager is the one I just built today; the others are a bit older.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Tue Jul 10 22:22:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 10 Jul 2007 18:22:46 -0400
Subject: [Linux-cluster] rgmanager on debian
In-Reply-To: <1bd301c7c314$a3aec290$0a01a8c0@office.nodex.ru>
References: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com>
	<1ad901c7c2fc$f8712420$0a01a8c0@office.nodex.ru>
	<1bd301c7c314$a3aec290$0a01a8c0@office.nodex.ru>
Message-ID: <20070710222246.GH18076@redhat.com>

On Tue, Jul 10, 2007 at 09:06:21PM +0400, Pavel D. Kuzin wrote:
> I`m trying to use rgmanager on debian.
> and see strange trouble.
> can you help me?
> 
> i`m trying to start httpd with rgmanager and see this in logs 
> But gfs volume mount os ok.
> I`m trying to understand source of  error=32 but i can`t....

Could you run:

clusvcadm -d hosting-httpd
rg_test test /etc/cluster/cluster.conf start service hosting-httpd
rg_test test /etc/cluster/cluster.conf stop service hosting-httpd

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Tue Jul 10 22:36:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 10 Jul 2007 18:36:30 -0400
Subject: [Linux-cluster] IP Relocate Error / IP Restart error
In-Reply-To: <46939CC4.5050305@algitech.com>
References: <468E535F.2080606@algitech.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA80E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<54098.10.11.12.13.1183990000.squirrel@10.11.12.13>
	<20070709212329.GA18076@redhat.com> <46939CC4.5050305@algitech.com>
Message-ID: <20070710223630.GI18076@redhat.com>

On Tue, Jul 10, 2007 at 04:50:44PM +0200, Dan Deshayes wrote:
> Jul 10 16:22:32 asl012 clurgmgrd[412]: <notice> Stopping service 
> service:www-project1
> Jul 10 16:22:32 asl012 clurgmgrd: [412]: <err> script:psql-db: stop of 
> /etc/init.d/postgresql failed (returned 1)
> Jul 10 16:22:32 asl012 clurgmgrd[412]: <notice> stop on script "psql-db" 
> returned 1 (generic error)

for this one:

http://sources.redhat.com/cluster/faq.html#rgm_wontrestart

> I'm using bond0 interface configured the following:
> DEVICE=bond0
> USERCTL=no
> ONBOOT=yes
> BROADCAST=<broadcast>
> NETWORK=<network>.32
> NETMASK=255.255.255.224
> IPADDR=<external ip 1>
> GATEWAY=<gw ip>

Ah - rgmanager won't start the *interface* for you.  You need an
already-running IP address on bond0 that is different from the one
rgmanager is trying to start.

So, if you have IPs on:

node1 192.168.1.2/24
node2 192.168.1.3/24
node3 192.168.1.4/24

You can have the IP address be on the 192.168.1.0/24 subnet, but not on,
say, 10.1.2.0/24.

That is, externally-facing IPs on the same subnet as the one you're
trying to manage need to be up (even if firewalled off) so rgmanager
can determine the device to use.  This is done so that different nodes
can use different devices (e.g. eth1 on 192.168.1.0/24 on nodes 1,2 but
bond0 on the same network on node 1...).

So, basically, you need a device on each node to already have an
external IP bound which is in the same subnet as the IP you're trying to
move between nodes.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Tue Jul 10 22:37:08 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 10 Jul 2007 18:37:08 -0400
Subject: [Linux-cluster] rgmanager on debian
In-Reply-To: <20070710222246.GH18076@redhat.com>
References: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com>
	<1ad901c7c2fc$f8712420$0a01a8c0@office.nodex.ru>
	<1bd301c7c314$a3aec290$0a01a8c0@office.nodex.ru>
	<20070710222246.GH18076@redhat.com>
Message-ID: <20070710223708.GJ18076@redhat.com>

On Tue, Jul 10, 2007 at 06:22:46PM -0400, Lon Hohberger wrote:
> On Tue, Jul 10, 2007 at 09:06:21PM +0400, Pavel D. Kuzin wrote:
> > I`m trying to understand source of  error=32 but i can`t....

It's the return code of the mount command.

> Could you run:
> 
> clusvcadm -d hosting-httpd
> rg_test test /etc/cluster/cluster.conf start service hosting-httpd
> rg_test test /etc/cluster/cluster.conf stop service hosting-httpd

(and get the output...)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From stanojr at blackhole.sk  Tue Jul 10 23:46:57 2007
From: stanojr at blackhole.sk (Pavel Stano)
Date: Wed, 11 Jul 2007 01:46:57 +0200
Subject: [Linux-cluster] strange slowness of ls with 1 newly created file on
	gfs 1 or 2
Message-ID: <46941A71.7010602@blackhole.sk>

Hello,

i am testing gfs and its very slow, please look at this if it is normal
or i miss something

i have 2 node cluster, nodes are connected via SAS to disk array promise
e310s, when i run dd on attached block device i have cca 150MBps
throughput on both nodes
there is debian etch, i compile cluster-2.00.00 with gfs1 module
i create one 475GB logical volume (i dont use clvmd, just normal lvm),
create gfs1 on it
gfs_mkfs -t cluster1:data0 -p lock_dlm -j 2 /dev/vgdata0/lvdata0
mount that lv on both nodes to directory /d/0/, run df on both nodes
and then run touch on node 1:
serpico# touch /d/0/test

and ls on node 2:
dinorscio:~# time ls /d/0/
test

real    0m9.486s
user    0m0.000s
sys     0m0.004s

it took almost 10 seconds to display 1 file on that filesystem
when i again create other file via touch(node1) and run ls (node2) it
took again cca 10 seconds
i monitor activity with dstat and there is 50% iowait on node where run
ls (50% because 2 core cpu on node), but no disk activity
nodes are connected via 1gbps idle ethernet

and when ls is runing, i look at wchan with ps
ps axf -o pid,wchan:20,cmd|grep ls
6387 sync_buffer                               \_ ls --color=auto /d/0/
i run ps many times there is still sync_buffer, i dont see other kernel
function

this is my cluster.conf
<?xml version="1.0"?>
<cluster name="cluster1" config_version="20">
        <cman expected_votes="1" two_node="1" />
        <clusternodes>
                <clusternode name="dinorscio" votes="1" nodeid="1">
                        <fence>
                        </fence>
                </clusternode>
                <clusternode name="serpico" votes="1" nodeid="2">
                        <fence>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

and last thing, i try gfs2, but same result

Thank you
--
Pavel Stano



From wcheng at redhat.com  Wed Jul 11 02:23:38 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 10 Jul 2007 22:23:38 -0400
Subject: [Linux-cluster] strange slowness of ls with 1 newly created file
	on	gfs 1 or 2
In-Reply-To: <46941A71.7010602@blackhole.sk>
References: <46941A71.7010602@blackhole.sk>
Message-ID: <46943F2A.3080707@redhat.com>

Pavel Stano wrote:

>and then run touch on node 1:
>serpico# touch /d/0/test
>
>and ls on node 2:
>dinorscio:~# time ls /d/0/
>test
>
>  
>

What have you expected from a cluster filesystem ? When you touch a file 
on node 1, it is a "create" that requires at least 2 exclusive locks 
(directory lock and the file lock itself, among many other things). On a 
local filesystem such as ext3, disk activities are delayed due to 
filesystem cache where "touch" writes the data into cache and "ls" reads 
it from cache on the very same node - all memory operations.  On cluster 
filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
release the locks (few ping-pong messages between two nodes and lock 
managers via network), the contents inside node 1's cache need to get 
synced to the shared storage. After node 2 gets the locks, it  has to 
read contents from the disk.

I hope the above explanation is clear.

>and last thing, i try gfs2, but same result
>
>
>  
>
-- Wendy



From pk at nodex.ru  Wed Jul 11 06:20:08 2007
From: pk at nodex.ru (Pavel D. Kuzin)
Date: Wed, 11 Jul 2007 10:20:08 +0400
Subject: [Linux-cluster] rgmanager on debian
References: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com><1ad901c7c2fc$f8712420$0a01a8c0@office.nodex.ru><1bd301c7c314$a3aec290$0a01a8c0@office.nodex.ru><20070710222246.GH18076@redhat.com>
	<20070710223708.GJ18076@redhat.com>
Message-ID: <1c2801c7c383$87ff0e00$0a01a8c0@office.nodex.ru>

node1:/usr/local/apache/bin# clusvcadm -d hosting
Member node1 disabling hosting...success
node1:/usr/local/apache/bin# rg_test test /etc/cluster/cluster.conf start service hosting
Running in test mode.
Starting hosting...
/usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found
mount: /dev/mapper/Hosting-Hosting_data already mounted or /usr/hosting busy
<err>    'mount -t gfs  /dev/mapper/Hosting-Hosting_data /usr/hosting' failed, error=32
Failed to start hosting
+++ Memory table dump +++
  0xb74b01f8 (16 bytes) allocation trace:
  0xb74adb4c (16 bytes) allocation trace:
  0xb74a5bb8 (16 bytes) allocation trace:
  0xb74ace34 (16 bytes) allocation trace:
  0xb749b270 (16 bytes) allocation trace:
  0xb74ad0b8 (16 bytes) allocation trace:
  0xb74a5d64 (16 bytes) allocation trace:
  0xb74a561c (16 bytes) allocation trace:
  0xb74b382c (776 bytes) allocation trace:
--- End Memory table dump ---
node1:/usr/local/apache/bin# df
Filesystem           1K-blocks      Used Available Use% Mounted on
10.210.10.200:/usr/hosting/nfs/ROOT
                      43759684  19140376  24619308  44% /
tmpfs                  1038060         0   1038060   0% /lib/init/rw
varrun                 1038060        64   1037996   1% /var/run
varlock                1038060         0   1038060   0% /var/lock
udev                     10240        48     10192   1% /dev
tmpfs                  1038060         0   1038060   0% /dev/shm


--
Pavel D.Kuzin
System Administrator
Nodex  ISP
St. Petersburg, Russia
pk at nodex.ru
http://nodex.ru

----- Original Message ----- 
From: "Lon Hohberger" <lhh at redhat.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Wednesday, July 11, 2007 2:37 AM
Subject: Re: [Linux-cluster] rgmanager on debian


> On Tue, Jul 10, 2007 at 06:22:46PM -0400, Lon Hohberger wrote:
>> On Tue, Jul 10, 2007 at 09:06:21PM +0400, Pavel D. Kuzin wrote:
>> > I`m trying to understand source of  error=32 but i can`t....
> 
> It's the return code of the mount command.
> 
>> Could you run:
>> 
>> clusvcadm -d hosting-httpd
>> rg_test test /etc/cluster/cluster.conf start service hosting-httpd
>> rg_test test /etc/cluster/cluster.conf stop service hosting-httpd
> 
> (and get the output...)
> 
> -- 
> Lon Hohberger - Software Engineer - Red Hat, Inc.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From pk at nodex.ru  Wed Jul 11 06:28:52 2007
From: pk at nodex.ru (Pavel D. Kuzin)
Date: Wed, 11 Jul 2007 10:28:52 +0400
Subject: [Linux-cluster] rgmanager on debian
References: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com><1ad901c7c2fc$f8712420$0a01a8c0@office.nodex.ru><1bd301c7c314$a3aec290$0a01a8c0@office.nodex.ru><20070710222246.GH18076@redhat.com>
	<20070710223708.GJ18076@redhat.com>
Message-ID: <1c3101c7c384$bfbcf680$0a01a8c0@office.nodex.ru>

node2:/usr/share/cluster# clusvcadm -d hosting
Member node2 disabling hosting...success
node2:/usr/share/cluster# rg_test test /etc/cluster/cluster.conf start service hosting
Running in test mode.
Starting hosting...
/usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found
/usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found
<info>   Executing /etc/init.d/httpd start
<info>   Executing /etc/init.d/proftpd start
Starting ftp server: proftpd.
Start of hosting complete
+++ Memory table dump +++
  0xb74d0694 (16 bytes) allocation trace:
  0xb74d7b4c (16 bytes) allocation trace:
  0xb74cfbb8 (16 bytes) allocation trace:
  0xb74d6e34 (16 bytes) allocation trace:
  0xb74c5270 (16 bytes) allocation trace:
  0xb74d70b8 (16 bytes) allocation trace:
  0xb74cfd64 (16 bytes) allocation trace:
  0xb74cf61c (16 bytes) allocation trace:
  0xb74dd82c (776 bytes) allocation trace:
--- End Memory table dump ---
node2:/usr/share/cluster# rg_test test /etc/cluster/cluster.conf stop service hosting
Running in test mode.
Stopping hosting...
<info>   Executing /etc/init.d/httpd stop
<info>   Executing /etc/init.d/proftpd stop
Stopping ftp server: proftpd.
<info>   unmounting /dev/mapper/Hosting-Hosting_data (/usr/hosting)
<info>   unmounting /dev/mapper/Hosting-TMP_Data (/tmp)
Stop of hosting complete
+++ Memory table dump +++
  0xb74e0694 (16 bytes) allocation trace:
  0xb74e7b4c (16 bytes) allocation trace:
  0xb74dfbb8 (16 bytes) allocation trace:
  0xb74e6e34 (16 bytes) allocation trace:
  0xb74d5270 (16 bytes) allocation trace:
  0xb74e70b8 (16 bytes) allocation trace:
  0xb74dfd64 (16 bytes) allocation trace:
  0xb74df61c (16 bytes) allocation trace:
  0xb74ed82c (776 bytes) allocation trace:
--- End Memory table dump ---
node2:/usr/share/cluster# clusvcadm -e hosting
Member node2 trying to enable hosting...failed

--
Pavel D.Kuzin
System Administrator
Nodex  ISP
St. Petersburg, Russia
pk at nodex.ru
http://nodex.ru

----- Original Message ----- 
From: "Lon Hohberger" <lhh at redhat.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Wednesday, July 11, 2007 2:37 AM
Subject: Re: [Linux-cluster] rgmanager on debian


> On Tue, Jul 10, 2007 at 06:22:46PM -0400, Lon Hohberger wrote:
>> On Tue, Jul 10, 2007 at 09:06:21PM +0400, Pavel D. Kuzin wrote:
>> > I`m trying to understand source of  error=32 but i can`t....
> 
> It's the return code of the mount command.
> 
>> Could you run:
>> 
>> clusvcadm -d hosting-httpd
>> rg_test test /etc/cluster/cluster.conf start service hosting-httpd
>> rg_test test /etc/cluster/cluster.conf stop service hosting-httpd
> 
> (and get the output...)
> 
> -- 
> Lon Hohberger - Software Engineer - Red Hat, Inc.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From pk at nodex.ru  Wed Jul 11 06:41:41 2007
From: pk at nodex.ru (Pavel D. Kuzin)
Date: Wed, 11 Jul 2007 10:41:41 +0400
Subject: [Linux-cluster] rgmanager on debian
References: <E67F1468BF7A4C418D874810215A377EAF9DCD@EITO-MBX01.internal.vodafone.com><1ad901c7c2fc$f8712420$0a01a8c0@office.nodex.ru><1bd301c7c314$a3aec290$0a01a8c0@office.nodex.ru><20070710222246.GH18076@redhat.com><20070710223708.GJ18076@redhat.com>
	<1c3101c7c384$bfbcf680$0a01a8c0@office.nodex.ru>
Message-ID: <1c3801c7c386$8a713f20$0a01a8c0@office.nodex.ru>

This problem was fixed.
Seems it was because in fstab was record /dev/Hosting/Hosting_data and rgmanager tryes to mount /dev/mapper/Hosting-Hosting_data.

Now i have another question.
How i can start one service on both nodes simultaneously ?

--
Pavel D.Kuzin
System Administrator
Nodex  ISP
St. Petersburg, Russia
pk at nodex.ru
http://nodex.ru

----- Original Message ----- 
From: "Pavel D. Kuzin" <pk at nodex.ru>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Wednesday, July 11, 2007 10:28 AM
Subject: Re: [Linux-cluster] rgmanager on debian


> node2:/usr/share/cluster# clusvcadm -d hosting
> Member node2 disabling hosting...success
> node2:/usr/share/cluster# rg_test test /etc/cluster/cluster.conf start service hosting
> Running in test mode.
> Starting hosting...
> /usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found
> /usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found
> <info>   Executing /etc/init.d/httpd start
> <info>   Executing /etc/init.d/proftpd start
> Starting ftp server: proftpd.
> Start of hosting complete
> +++ Memory table dump +++
>  0xb74d0694 (16 bytes) allocation trace:
>  0xb74d7b4c (16 bytes) allocation trace:
>  0xb74cfbb8 (16 bytes) allocation trace:
>  0xb74d6e34 (16 bytes) allocation trace:
>  0xb74c5270 (16 bytes) allocation trace:
>  0xb74d70b8 (16 bytes) allocation trace:
>  0xb74cfd64 (16 bytes) allocation trace:
>  0xb74cf61c (16 bytes) allocation trace:
>  0xb74dd82c (776 bytes) allocation trace:
> --- End Memory table dump ---
> node2:/usr/share/cluster# rg_test test /etc/cluster/cluster.conf stop service hosting
> Running in test mode.
> Stopping hosting...
> <info>   Executing /etc/init.d/httpd stop
> <info>   Executing /etc/init.d/proftpd stop
> Stopping ftp server: proftpd.
> <info>   unmounting /dev/mapper/Hosting-Hosting_data (/usr/hosting)
> <info>   unmounting /dev/mapper/Hosting-TMP_Data (/tmp)
> Stop of hosting complete
> +++ Memory table dump +++
>  0xb74e0694 (16 bytes) allocation trace:
>  0xb74e7b4c (16 bytes) allocation trace:
>  0xb74dfbb8 (16 bytes) allocation trace:
>  0xb74e6e34 (16 bytes) allocation trace:
>  0xb74d5270 (16 bytes) allocation trace:
>  0xb74e70b8 (16 bytes) allocation trace:
>  0xb74dfd64 (16 bytes) allocation trace:
>  0xb74df61c (16 bytes) allocation trace:
>  0xb74ed82c (776 bytes) allocation trace:
> --- End Memory table dump ---
> node2:/usr/share/cluster# clusvcadm -e hosting
> Member node2 trying to enable hosting...failed
> 
> --
> Pavel D.Kuzin
> System Administrator
> Nodex  ISP
> St. Petersburg, Russia
> pk at nodex.ru
> http://nodex.ru
> 
> ----- Original Message ----- 
> From: "Lon Hohberger" <lhh at redhat.com>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Wednesday, July 11, 2007 2:37 AM
> Subject: Re: [Linux-cluster] rgmanager on debian
> 
> 
>> On Tue, Jul 10, 2007 at 06:22:46PM -0400, Lon Hohberger wrote:
>>> On Tue, Jul 10, 2007 at 09:06:21PM +0400, Pavel D. Kuzin wrote:
>>> > I`m trying to understand source of  error=32 but i can`t....
>> 
>> It's the return code of the mount command.
>> 
>>> Could you run:
>>> 
>>> clusvcadm -d hosting-httpd
>>> rg_test test /etc/cluster/cluster.conf start service hosting-httpd
>>> rg_test test /etc/cluster/cluster.conf stop service hosting-httpd
>> 
>> (and get the output...)
>> 
>> -- 
>> Lon Hohberger - Software Engineer - Red Hat, Inc.
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From bsd_daemon at msn.com  Wed Jul 11 08:44:56 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Wed, 11 Jul 2007 08:44:56 +0000
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
Message-ID: <BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>


hi, firstly, the piranha is not supporting lvs/dr or lvs/tun right now. may 
in future..

the following paragraph is from
http://www.redhat.com/support/resources/howto/piranha/

"Currently, the LVS cluster supports one routing method,
Network Address Translation (NAT). (In the future, tunneling
and direct routing will be added.)"

secondly, your configuration is wrong.

LVS eth0 -> 10.1.40.45
      eth0:0 -> 10.1.40.40
LVS eth1 -> 11.1.40.45
      eth1:0 -> 11.1.40.40

RS eth0 -> 10.1.40.34

you should write "10.1.40.40 and eth0:0" to "NAT Router IP and Device" on 
piranha. is not eth1:1. the default gw should be set "10.1.40.40" for RS. 
Then, "Virtual Service" IP should be set "11.1.40.40 and eth1:0".

REQUEST  =   client -> (11.1.40.40)  ipvs  forward on lvs   (10.1.40.40)  -> 
  (10.1.40.34) RS
RESPOND  =   client <- (11.1.40.40)  NAT on lvs-gateway  (10.1.40.40)  <-  
(10.1.40.34) RS

if you change configuration, you must restart pulse service.

have a nice day.


Mehmet CELIK
Istanbul/TURKEY

>From: "Panigrahi, Santosh Kumar" <Santosh.Panigrahi at in.unisys.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: <linux-cluster at redhat.com>
>Subject: [Linux-cluster] doubts in Piranha
>Date: Tue, 10 Jul 2007 17:33:37 +0530
>
>Hi,
>
>
>
>I want to know, whether piranha supports configuring the following LVS
>cluster.
>
>
>
>1) Direct routing
>
>2) Nat routing
>
>3) Tunneling
>
>
>
>Actually, I am configuring a NAT LVS cluster in piranha with one LVS
>router and one Real Server. (For LVS testing purpose)
>
>I am doing it in RHEL5 with piranha-0.8.4-7.el5.
>
>
>
>LVS Router: 2 NIC cards (eth0: 10.1.40.45 - private interface, eth1:
>11.1.40.45- public interface, eth1:0: 11.1.40.40- Virtual Public IP)
>
>Real server:  1 NIC card (eth0: 10.1.40.34)
>
>
>
>My piranha configuration is as follows.
>
>
>
>Virtual server Tab:
>
>
>
>
>
>Global Settings Tab:
>
>
>
>
>
>I am configuring it to open a telnet session from a client PC, which is
>not happening at all. My guess is piranha is not updating the IPVSADM
>table properly as per the configuration or I have configured it wrongly.
>Can somebody help me in this regard?
>
>
>
>Thanks and Regards,
>
>Santosh
>
><< image001.jpg >>
><< image004.jpg >>


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://newlivehotmail.com



From hal_bg at yahoo.com  Wed Jul 11 10:34:24 2007
From: hal_bg at yahoo.com (Hal)
Date: Wed, 11 Jul 2007 03:34:24 -0700 (PDT)
Subject: [Linux-cluster] Redhat cluster and GFS node limit
Message-ID: <430395.53669.qm@web32210.mail.mud.yahoo.com>

Hi all,

I have a project which requires a cluster of over 100 GFS nodes. GFS supports
them, but RH cluster supports only 16 nodes. For this case the documentation
says:

"Red Hat GFS is scalable up to 300 nodes, a Red Hat Cluster Manager limits the
total number of nodes in a cluster to 16. Therefore, in this scenario, Red Hat
GFS scalability is limited. If the 16-node limit is too small for your
deployment, you may want to consider using multiple Red Hat Cluster Manager
clusters."

So any docs or ideas on how to build "multiple Red Hat Cluster Manager
clusters"? 
Will this solution enable all 100+ nodes to access the same File system? 


Thank you!
Hal


 
____________________________________________________________________________________
Looking for earth-friendly autos? 
Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.
http://autos.yahoo.com/green_center/



From Santosh.Panigrahi at in.unisys.com  Wed Jul 11 11:41:51 2007
From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar)
Date: Wed, 11 Jul 2007 17:11:51 +0530
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
	<BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
Message-ID: <D566E8CF3538B54D95B925CB69CB4D2A0526DDEE@inblr-exch1.eu.uis.unisys.com>

Hi Mehmet CELIK,

 

Thanks for your reply.

 

I have changed my piranha configuration as per your instruction. Still
it is not working. Configuration is follows.

 

LVS eth0 -> 10.1.40.45

      eth0:0 -> 10.1.40.50

LVS eth1 -> 11.1.40.45

      eth1:0 -> 11.1.40.50

 

RS eth0 -> 10.1.40.34

 

My lvs.cf is as follows:

serial_no = 12

primary = 11.1.40.45

service = lvs

network = nat

nat_router = 10.1.40.50 eth0:0

nat_nmask = 255.255.255.0

debug_level = NONE

virtual Telnet {

     active = 1

     address = 11.1.40.50 eth1:0

     vip_nmask = 255.255.255.0

     port = 23

     send = "GET / HTTP/1.0\r\n\r\n"

     expect = "HTTP"

     use_regex = 0

     load_monitor = none

     scheduler = wlc

     protocol = tcp

     timeout = 6

     reentry = 15

     quiesce_server = 0

     server m1034 {

         address = 10.1.40.34

         active = 1

         weight = 100

     }

}

 

After this I am starting pulse service in my LVS.

Following output I am getting in my LVS on passing ipvsadm -L -n comman.


 

[root at m1045 ha]# ipvsadm -L -n

IP Virtual Server version 1.2.1 (size=4096)

Prot LocalAddress:Port Scheduler Flags

  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn

TCP  11.1.40.50:23 wlc

[root at m1045 ha]#

 

Note: Here there is no route for forward path to Real server. I am
guesing piranha is not updating it. 

 

Log from CONTROL/MONITORING Tab of Piranha:

 

 

 

What should I do after this?

 

Thanks,

Santosh

 

 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of mehmet celik
Sent: Wednesday, July 11, 2007 2:15 PM
To: linux-cluster at redhat.com
Subject: RE: [Linux-cluster] doubts in Piranha

 

 

hi, firstly, the piranha is not supporting lvs/dr or lvs/tun right now.
may 

in future..

 

the following paragraph is from

http://www.redhat.com/support/resources/howto/piranha/

 

"Currently, the LVS cluster supports one routing method,

Network Address Translation (NAT). (In the future, tunneling

and direct routing will be added.)"

 

secondly, your configuration is wrong.

 

LVS eth0 -> 10.1.40.45

      eth0:0 -> 10.1.40.40

LVS eth1 -> 11.1.40.45

      eth1:0 -> 11.1.40.40

 

RS eth0 -> 10.1.40.34

 

you should write "10.1.40.40 and eth0:0" to "NAT Router IP and Device"
on 

piranha. is not eth1:1. the default gw should be set "10.1.40.40" for
RS. 

Then, "Virtual Service" IP should be set "11.1.40.40 and eth1:0".

 

REQUEST  =   client -> (11.1.40.40)  ipvs  forward on lvs   (10.1.40.40)
-> 

  (10.1.40.34) RS

RESPOND  =   client <- (11.1.40.40)  NAT on lvs-gateway  (10.1.40.40)
<-  

(10.1.40.34) RS

 

if you change configuration, you must restart pulse service.

 

have a nice day.

 

 

Mehmet CELIK

Istanbul/TURKEY

 

>From: "Panigrahi, Santosh Kumar" <Santosh.Panigrahi at in.unisys.com>

>Reply-To: linux clustering <linux-cluster at redhat.com>

>To: <linux-cluster at redhat.com>

>Subject: [Linux-cluster] doubts in Piranha

>Date: Tue, 10 Jul 2007 17:33:37 +0530

> 

>Hi,

> 

> 

> 

>I want to know, whether piranha supports configuring the following LVS

>cluster.

> 

> 

> 

>1) Direct routing

> 

>2) Nat routing

> 

>3) Tunneling

> 

> 

> 

>Actually, I am configuring a NAT LVS cluster in piranha with one LVS

>router and one Real Server. (For LVS testing purpose)

> 

>I am doing it in RHEL5 with piranha-0.8.4-7.el5.

> 

> 

> 

>LVS Router: 2 NIC cards (eth0: 10.1.40.45 - private interface, eth1:

>11.1.40.45- public interface, eth1:0: 11.1.40.40- Virtual Public IP)

> 

>Real server:  1 NIC card (eth0: 10.1.40.34)

> 

> 

> 

>My piranha configuration is as follows.

> 

> 

> 

>Virtual server Tab:

> 

> 

> 

> 

> 

>Global Settings Tab:

> 

> 

> 

> 

> 

>I am configuring it to open a telnet session from a client PC, which is

>not happening at all. My guess is piranha is not updating the IPVSADM

>table properly as per the configuration or I have configured it
wrongly.

>Can somebody help me in this regard?

> 

> 

> 

>Thanks and Regards,

> 

>Santosh

> 

><< image001.jpg >>

><< image004.jpg >>

 

 

>--

>Linux-cluster mailing list

>Linux-cluster at redhat.com

>https://www.redhat.com/mailman/listinfo/linux-cluster

 

_________________________________________________________________

http://newlivehotmail.com

 

--

Linux-cluster mailing list

Linux-cluster at redhat.com

https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070711/7ef2051d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 141773 bytes
Desc: image001.jpg
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070711/7ef2051d/attachment.jpg>

From Christian.Schaefer2 at messe-muenchen.de  Wed Jul 11 12:31:36 2007
From: Christian.Schaefer2 at messe-muenchen.de (Schaefer Christian)
Date: Wed, 11 Jul 2007 14:31:36 +0200
Subject: AW: [Linux-cluster] Redhat cluster and GFS node limit
In-Reply-To: <430395.53669.qm@web32210.mail.mud.yahoo.com>
References: <430395.53669.qm@web32210.mail.mud.yahoo.com>
Message-ID: <56825E6A1E167E41A80CE0F6F6916CAB0266A7A3@MNTSVCL10E.mmgmuc.de>

Dear Hal,

this limitation means, that you can only run up to 16 rgmanagers in one cluster infrastructure. rgmanager thereby is only needed if you build HA services like a active/passive MySQL cluster services. If you like to build a 100 node webserver farm with external loadbalancing/failover - such as LVS and/or keepalived - no rgmanager is needed to be run on that cluster. 

In this case you can set up a 100 node cluster with shared gfs devices as you like. I too recommend you to have a look at http://open-sharedroot.org/documentation. The concept demonstarted on that page is do build a n node cluster from one filesystem single image which keeps administration simple.

Greetings,
Christian

Best Regards, 

Christian Sch?fer
Abt. IT-Applications
Messe M?nchen 

Tel.: +49 (0) 89 949 21985 
E-Mail: christian.schaefer2 at messe-muenchen.de 
WWW: www.messe-muenchen.de 

 

> -----Urspr?ngliche Nachricht-----
> Von: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Hal
> Gesendet: Mittwoch, 11. Juli 2007 12:34
> An: linux-cluster at redhat.com
> Betreff: [Linux-cluster] Redhat cluster and GFS node limit
> 
> Hi all,
> 
> I have a project which requires a cluster of over 100 GFS 
> nodes. GFS supports
> them, but RH cluster supports only 16 nodes. For this case 
> the documentation
> says:
> 
> "Red Hat GFS is scalable up to 300 nodes, a Red Hat Cluster 
> Manager limits the
> total number of nodes in a cluster to 16. Therefore, in this 
> scenario, Red Hat
> GFS scalability is limited. If the 16-node limit is too small for your
> deployment, you may want to consider using multiple Red Hat 
> Cluster Manager
> clusters."
> 
> So any docs or ideas on how to build "multiple Red Hat Cluster Manager
> clusters"? 
> Will this solution enable all 100+ nodes to access the same 
> File system? 
> 
> 
> Thank you!
> Hal
> 
> 
>  
> ______________________________________________________________
> ______________________
> Looking for earth-friendly autos? 
> Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.
> http://autos.yahoo.com/green_center/
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From tomas.hoger at gmail.com  Wed Jul 11 12:10:14 2007
From: tomas.hoger at gmail.com (Tomas Hoger)
Date: Wed, 11 Jul 2007 14:10:14 +0200
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
	<BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
Message-ID: <6cfbd1b40707110510x69096d47ya03db1834b9217c1@mail.gmail.com>

Hi!

On 7/11/07, mehmet celik <bsd_daemon at msn.com> wrote:
> hi, firstly, the piranha is not supporting lvs/dr or lvs/tun right now. may
> in future..
>
> the following paragraph is from
> http://www.redhat.com/support/resources/howto/piranha/
>
> "Currently, the LVS cluster supports one routing method,
> Network Address Translation (NAT). (In the future, tunneling
> and direct routing will be added.)"

I'm not sure why referenced document says that piranha does not
support DR, maybe it's out of date or maybe quoted part was supposed
to have different meaning in proper context.  It is possible to use DR
and official Red Hat documentation provides some instructions on how
to do it.  See:

RHEL5:
http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Virtual_Server_Administration/index.html

RHEL4:
http://www.redhat.com/docs/manuals/csgfs/browse/4.5/SAC_Virtual_Server_Administration/index.html


> secondly, your configuration is wrong.
>
> LVS eth0 -> 10.1.40.45
>       eth0:0 -> 10.1.40.40
> LVS eth1 -> 11.1.40.45
>       eth1:0 -> 11.1.40.40
>
> RS eth0 -> 10.1.40.34

First of all, you should use *different* networks on each interface of
load balancer with LVS-NAT.  I think there are some instructions in
LVS howto about single network configurations, but two networks are
usually recommended.  And of course your load balancer must be set as
default GW on real servers.

Try checking Red Hat documents referenced above or check howtos on LVS
site (e.g. mini-HOWTO) for information about typical topologies.

th.



From bsd_daemon at msn.com  Wed Jul 11 13:39:00 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Wed, 11 Jul 2007 13:39:00 +0000
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <D566E8CF3538B54D95B925CB69CB4D2A0526DDEE@inblr-exch1.eu.uis.unisys.com>
Message-ID: <BAY105-F948CC3DFC60C3E7AE2235E3040@phx.gbl>


hi, did you make "ip_forward" active on lvs ? lvs must be default gw for 
RSs. And, did you check RSs connection on LVS, because on lvs, RSs can not 
seen.

>root at m1045 ha]# ipvsadm -L -n
>
>IP Virtual Server version 1.2.1 (size=4096)
>Prot LocalAddress:Port Scheduler Flags
>
>  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
>TCP  11.1.40.50:23 wlc
>

my lvs configuration is below (LVS/NAT)

Main Network 10.0.0.0/24
Main Router 10.0.0.254
LVS eth0 -> 10.0.0.100
      eth0:0 -> 10.0.0.200    (Virtual IP for Virtual Service)
      eth1 -> 172.16.1.1/24  (the gateway for RS1-RS6)

RS1 eth0 -> 172.16.1.101
RS2 eth0 -> 172.16.1.102
RS3 eth0 -> 172.16.1.103
RS4 eth0 -> 172.16.1.104
RS5 eth0 -> 172.16.1.105
RS6 eth0 -> 172.16.1.106

ipvsadm -L -n

TCP 10.0.0.201:80  rr
  -> 172.16.1.101:80    Masq  1   0   X
  -> 172.16.1.102:80    Masq  1   0   X
  -> 172.16.1.103:80    Masq  1   0   X
  -> 172.16.1.104:80    Masq  1   0   X
  -> 172.16.1.105:80    Masq  1   0   X
  -> 172.16.1.106:80    Masq  1   0   X

have a nice day..

>From: "Panigrahi, Santosh Kumar" <Santosh.Panigrahi at in.unisys.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>Subject: RE: [Linux-cluster] doubts in Piranha
>Date: Wed, 11 Jul 2007 17:11:51 +0530
>
>Hi Mehmet CELIK,
>
>
>
>Thanks for your reply.
>
>
>
>I have changed my piranha configuration as per your instruction. Still
>it is not working. Configuration is follows.
>
>
>
>LVS eth0 -> 10.1.40.45
>
>       eth0:0 -> 10.1.40.50
>
>LVS eth1 -> 11.1.40.45
>
>       eth1:0 -> 11.1.40.50
>
>
>
>RS eth0 -> 10.1.40.34
>
>
>
>My lvs.cf is as follows:
>
>serial_no = 12
>
>primary = 11.1.40.45
>
>service = lvs
>
>network = nat
>
>nat_router = 10.1.40.50 eth0:0
>
>nat_nmask = 255.255.255.0
>
>debug_level = NONE
>
>virtual Telnet {
>
>      active = 1
>
>      address = 11.1.40.50 eth1:0
>
>      vip_nmask = 255.255.255.0
>
>      port = 23
>
>      send = "GET / HTTP/1.0\r\n\r\n"
>
>      expect = "HTTP"
>
>      use_regex = 0
>
>      load_monitor = none
>
>      scheduler = wlc
>
>      protocol = tcp
>
>      timeout = 6
>
>      reentry = 15
>
>      quiesce_server = 0
>
>      server m1034 {
>
>          address = 10.1.40.34
>
>          active = 1
>
>          weight = 100
>
>      }
>
>}
>
>
>
>After this I am starting pulse service in my LVS.
>
>Following output I am getting inmy LVS on passing  ipvsadm -L -n comman.
>
>
>
>
>[root at m1045 ha]# ipvsadm -L -n
>
>IP Virtual Server version 1.2.1 (size=4096)
>
>Prot LocalAddress:Port Scheduler Flags
>
>   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
>
>TCP  11.1.40.50:23 wlc
>
>[root at m1045 ha]#
>
>
>
>Note: Here there is no route for forward path to Real server. I am
>guesing piranha is not updating it.
>
>
>
>Log from CONTROL/MONITORING Tab of Piranha:
>
>
>
>
>
>
>
>What should I do after this?
>
>
>
>Thanks,
>
>Santosh
>
>
>
>
>
>-----Original Message-----
>From: linux-cluster-bounces at redhat.com
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of mehmet celik
>Sent: Wednesday, July 11, 2007 2:15 PM
>To: linux-cluster at redhat.com
>Subject: RE: [Linux-cluster] doubts in Piranha
>
>
>
>
>
>hi, firstly, the piranha is not supporting lvs/dr or lvs/tun right now.
>may
>
>in future..
>
>
>
>the following paragraph is from
>
>http://www.redhat.com/support/resources/howto/piranha/
>
>
>
>"Currently, the LVS cluster supports one routing method,
>
>Network Address Translation (NAT). (In the future, tunneling
>
>and direct routing will be added.)"
>
>
>
>secondly, your configuration is wrong.
>
>
>
>LVS eth0 -> 10.1.40.45
>
>       eth0:0 -> 10.1.40.40
>
>LVS eth1 -> 11.1.40.45
>
>       eth1:0 -> 11.1.40.40
>
>
>
>RS eth0 -> 10.1.40.34
>
>
>
>you should write "10.1.40.40 and eth0:0" to "NAT Router IP and Device"
>on
>
>piranha. is not eth1:1. the default gw should be set "10.1.40.40" for
>RS.
>
>Then, "Virtual Service" IP should be set "11.1.40.40 and eth1:0".
>
>
>
>REQUEST  =   client -> (11.1.40.40)  ipvs  forward on lvs   (10.1.40.40)
>->
>
>   (10.1.40.34) RS
>
>RESPOND  =   client <- (11.1.40.40)  NAT on lvs-gateway  (10.1.40.40)
><-
>
>(10.1.40.34) RS
>
>
>
>if you change configuration, you must restart pulse service.
>
>
>
>have a nice day.
>
>
>
>
>
>Mehmet CELIK
>
>Istanbul/TURKEY
>
>
>
> >From: "Panigrahi, Santosh Kumar" <Santosh.Panigrahi at in.unisys.com>
>
> >Reply-To: linux clustering <linux-cluster at redhat.com>
>
> >To: <linux-cluster at redhat.com>
>
> >Subject: [Linux-cluster] doubts in Piranha
>
> >Date: Tue, 10 Jul 2007 17:33:37 +0530
>
> >
>
> >Hi,
>
> >
>
> >
>
> >
>
> >I want to know, whether piranha supports configuring the following LVS
>
> >cluster.
>
> >
>
> >
>
> >
>
> >1) Direct routing
>
> >
>
> >2) Nat routing
>
> >
>
> >3) Tunneling
>
> >
>
> >
>
> >
>
> >Actually, I am configuring a NAT LVS cluster in piranha with one LVS
>
> >router and one Real Server. (For LVS testing purpose)
>
> >
>
> >I am doing it in RHEL5 with piranha-0.8.4-7.el5.
>
> >
>
> >
>
> >
>
> >LVS Router: 2 NIC cards (eth0: 10.1.40.45 - private interface, eth1:
>
> >11.1.40.45- public interface, eth1:0: 11.1.40.40- Virtual Public IP)
>
> >
>
> >Real server:  1 NIC card (eth0: 10.1.40.34)
>
> >
>
> >
>
> >
>
> >My piranha configuration is as follows.
>
> >
>
> >
>
> >
>
> >Virtual server Tab:
>
> >
>
> >
>
> >
>
> >
>
> >
>
> >Global Settings Tab:
>
> >
>
> >
>
> >
>
> >
>
> >
>
> >I am configuring it to open a telnet session from a client PC, which is
>
> >not happening at all. My guess is piranha is not updating the IPVSADM
>
> >table properly as per the configuration or I have configured it
>wrongly.
>
> >Can somebody help me in this regard?
>
> >
>
> >
>
> >
>
> >Thanks and Regards,
>
> >
>
> >Santosh
>
> >
>
> ><< image001.jpg >>
>
> ><< image004.jpg >>
>
>
>
>
>
> >--
>
> >Linux-cluster mailing list
>
> >Linux-cluster at redhat.com
>
> >https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>_________________________________________________________________
>
>http://newlivehotmail.com
>
>
>
>--
>
>Linux-cluster mailing list
>
>Linux-cluster at redhat.com
>
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
><< image001.jpg >>


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://newlivehotmail.com



From lhh at redhat.com  Wed Jul 11 14:04:13 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 10:04:13 -0400
Subject: [Linux-cluster] issues starting services
In-Reply-To: <20070706165621.GM1681@redhat.com>
References: <46891943.6010202@cmiware.com> <20070706165621.GM1681@redhat.com>
Message-ID: <20070711140413.GK18076@redhat.com>

On Fri, Jul 06, 2007 at 12:56:21PM -0400, Lon Hohberger wrote:
> On Mon, Jul 02, 2007 at 10:26:59AM -0500, Chris Harms wrote:
> > Had a nice little hardware failure over the weekend.  After having the 
> > machine come back on-line, 2 of the registered services didn't start 
> > (the fact that they weren't running already is a function of services 
> > not failing over until fencing succeeds).
> > 
> > Issuing a start operation in Conga did nothing. 
> > 
> > Issuing  clusvcadm -e [service] -m [node] yielded:
> >    Member [node] trying to enable service:[service]...Success
> >    service:[service] is now running on [node]
> > This was not the case.  Nothing happened.  Nothing was logged.

Oh - also, run:

rg_test test [your_cluster_conf] &> rg_test.out

If you have resource conflicts (e.g. the same IP referenced by two
services), the second service will have those resources ignored.

If you see something like "Primary/unique not unique...", you've got one
:)

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 11 14:07:55 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 10:07:55 -0400
Subject: [Linux-cluster] rgmanager on debian
In-Reply-To: <1c2801c7c383$87ff0e00$0a01a8c0@office.nodex.ru>
References: <20070710223708.GJ18076@redhat.com>
	<1c2801c7c383$87ff0e00$0a01a8c0@office.nodex.ru>
Message-ID: <20070711140755.GL18076@redhat.com>

On Wed, Jul 11, 2007 at 10:20:08AM +0400, Pavel D. Kuzin wrote:
> node1:/usr/local/apache/bin# clusvcadm -d hosting
> Member node1 disabling hosting...success
> node1:/usr/local/apache/bin# rg_test test /etc/cluster/cluster.conf start 
> service hosting
> Running in test mode.

> Starting hosting...
> /usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found
> mount: /dev/mapper/Hosting-Hosting_data already mounted or /usr/hosting busy
> <err>    'mount -t gfs  /dev/mapper/Hosting-Hosting_data /usr/hosting' 
> failed, error=32
> Failed to start hosting

Ok, so it actually *failed* to mount.  Now, if it was already mounted,
that's a bug - clusterfs.sh shouldn't try to mount the device a second
time.  According to your output below, it *was not* mounted.

Is there anything in dmesg or /var/log/messages which you think might
help?


> +++ Memory table dump +++
>  0xb74b01f8 (16 bytes) allocation trace:
>  0xb74adb4c (16 bytes) allocation trace:
>  0xb74a5bb8 (16 bytes) allocation trace:
>  0xb74ace34 (16 bytes) allocation trace:
>  0xb749b270 (16 bytes) allocation trace:
>  0xb74ad0b8 (16 bytes) allocation trace:
>  0xb74a5d64 (16 bytes) allocation trace:
>  0xb74a561c (16 bytes) allocation trace:
>  0xb74b382c (776 bytes) allocation trace:
> --- End Memory table dump ---

Hrm, that's not good, but ultimately not the cause of the problem.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From bsd_daemon at msn.com  Wed Jul 11 13:58:10 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Wed, 11 Jul 2007 13:58:10 +0000
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <6cfbd1b40707110510x69096d47ya03db1834b9217c1@mail.gmail.com>
Message-ID: <BAY105-F292A0C10AF7494A8C699EEE3040@phx.gbl>


hi Tomas, thanx for information. i didn't know it's change.

right now, i am using heartbeat+mon+iprouting+iptables for LVS/DR. this 
complex is working the perfect.

Mehmet CELIK
Istanbul/TURKEY

>From: "Tomas Hoger" <tomas.hoger at gmail.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] doubts in Piranha
>Date: Wed, 11 Jul 2007 14:10:14 +0200
>
>Hi!
>
>On 7/11/07, mehmet celik <bsd_daemon at msn.com> wrote:
>>hi, firstly, the piranha is not supporting lvs/dr or lvs/tun right now. 
>>may
>>in future..
>>
>>the following paragraph is from
>>http://www.redhat.com/support/resources/howto/piranha/
>>
>>"Currently, the LVS cluster supports one routing method,
>>Network Address Translation (NAT). (In the future, tunneling
>>and direct routing will be added.)"
>
>I'm not sure why referenced document says that piranha does not
>support DR, maybe it's out of date or maybe quoted part was supposed
>to have different meaning in proper context.  It is possible to use DR
>and official Red Hat documentation provides some instructions on how
>to do it.  See:
>
>RHEL5:
>http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Virtual_Server_Administration/index.html
>
>RHEL4:
>http://www.redhat.com/docs/manuals/csgfs/browse/4.5/SAC_Virtual_Server_Administration/index.html
>
>
>>secondly, your configuration is wrong.
>>
>>LVS eth0 -> 10.1.40.45
>>       eth0:0 -> 10.1.40.40
>>LVS eth1 -> 11.1.40.45
>>       eth1:0 -> 11.1.40.40
>>
>>RS eth0 -> 10.1.40.34
>
>First of all, you should use *different* networks on each interface of
>load balancer with LVS-NAT.  I think there are some instructions in
>LVS howto about single network configurations, but two networks are
>usually recommended.  And of course your load balancer must be set as
>default GW on real servers.
>
>Try checking Red Hat documents referenced above or check howtos on LVS
>site (e.g. mini-HOWTO) for information about typical topologies.
>
>th.
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Need a brain boost? Recharge with a stimulating game. Play now!? 
http://club.live.com/home.aspx?icid=club_hotmailtextlink1



From lhh at redhat.com  Wed Jul 11 14:14:27 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 10:14:27 -0400
Subject: [Linux-cluster] rgmanager on debian
In-Reply-To: <1c3101c7c384$bfbcf680$0a01a8c0@office.nodex.ru>
References: <20070710223708.GJ18076@redhat.com>
	<1c3101c7c384$bfbcf680$0a01a8c0@office.nodex.ru>
Message-ID: <20070711141427.GM18076@redhat.com>

On Wed, Jul 11, 2007 at 10:28:52AM +0400, Pavel D. Kuzin wrote:
> node2:/usr/share/cluster# clusvcadm -d hosting
> Member node2 disabling hosting...success
> node2:/usr/share/cluster# rg_test test /etc/cluster/cluster.conf start 
> service hosting
> Running in test mode.
> Starting hosting...
> /usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found
> /usr/share/cluster/clusterfs.sh: line 729: logAndPrint: command not found

bug, pretty minor, but not a problem.  Patch attached; or replacement
clusterfs.sh here:

http://people.redhat.com/lhh/clusterfs.sh

(copy over old one; if you need to "back up" the old one, move it *out*
of /usr/share/cluster)

> <info>   Executing /etc/init.d/httpd start
> <info>   Executing /etc/init.d/proftpd start
> Starting ftp server: proftpd.
> Start of hosting complete
> node2:/usr/share/cluster# rg_test test /etc/cluster/cluster.conf stop 
> service hosting
> Running in test mode.
> Stopping hosting...
> <info>   Executing /etc/init.d/httpd stop
> <info>   Executing /etc/init.d/proftpd stop
> Stopping ftp server: proftpd.
> <info>   unmounting /dev/mapper/Hosting-Hosting_data (/usr/hosting)
> <info>   unmounting /dev/mapper/Hosting-TMP_Data (/tmp)
> Stop of hosting complete
> node2:/usr/share/cluster# clusvcadm -e hosting
> Member node2 trying to enable hosting...failed

So on node2, it works from within rg_test but not from clusvcadm.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.
-------------- next part --------------
? .nfsclient.sh.swp
Index: clusterfs.sh
===================================================================
RCS file: /cvs/cluster/cluster/rgmanager/src/resources/clusterfs.sh,v
retrieving revision 1.1.2.13
diff -u -r1.1.2.13 clusterfs.sh
--- clusterfs.sh	21 May 2007 15:56:47 -0000	1.1.2.13
+++ clusterfs.sh	11 Jul 2007 14:12:22 -0000
@@ -354,7 +354,7 @@
 	typeset junk
 
 	if [ $# -ne 2 ]; then
-		logAndPrint $LOG_ERR "Usage: mountInUse device mount_point".
+		ocf_log err "Usage: mountInUse device mount_point".
 		return $FAIL
 	fi
 
@@ -432,14 +432,14 @@
 	declare rw
 	
 	if [ $# -ne 1 ]; then
-	        logAndPrint $LOG_ERR "Usage: isAlive mount_point"
+	        ocf_log err "Usage: isAlive mount_point"
 		return $FAIL
 	fi
 	mount_point=$1
 	
 	test -d $mount_point
 	if [ $? -ne 0 ]; then
-		logAndPrint $LOG_ERR "$mount_point is not a directory"
+		ocf_log err "$mount_point is not a directory"
 		return $FAIL
 	fi
 	
@@ -726,7 +726,7 @@
 	#
 	# Mount the device
 	#
-	logAndPrint $LOG_DEBUG "mount $fstype_option $mount_options $dev $mp"
+	ocf_log debug "mount $fstype_option $mount_options $dev $mp"
 	mount $fstype_option $mount_options $dev $mp
 	ret_val=$?
 	if [ $ret_val -ne 0 ]; then
Index: nfsclient.sh
===================================================================
RCS file: /cvs/cluster/cluster/rgmanager/src/resources/nfsclient.sh,v
retrieving revision 1.3.2.12
diff -u -r1.3.2.12 nfsclient.sh
--- nfsclient.sh	3 May 2007 15:02:47 -0000	1.3.2.12
+++ nfsclient.sh	11 Jul 2007 14:12:22 -0000
@@ -338,13 +338,36 @@
         # 
 	export OCF_RESKEY_target_regexp=$(echo $OCF_RESKEY_target | \
 		sed -e 's/*/[*]/g' -e 's/?/[?]/g' -e 's/\./\\./g') 
-        exportfs -v | tr -d "\n" | sed -e 's/([^)]*)/\n/g' | grep -q \
-		"^${OCF_RESKEY_path}[\t ]*.*${OCF_RESKEY_target_regexp}" 
 
+	declare tmpfn=/tmp/nfsclient-$OCF_RESKEY_name.status.$$
+	exportfs -v > $tmpfn
+        cat $tmpfn | tr -d "\n" | sed -e 's/([^)]*)/\n/g' | grep -iq \
+		"^${OCF_RESKEY_path}[\t ]*.*${OCF_RESKEY_target_regexp}" 
 	rv=$? 
-	if [ $rv -ne 0 ]; then
+
+	if [ $rv -eq 0 ]; then
+		exit 0
+	fi
+
+	declare OCF_RESKEY_target_tmp=$(clufindhostname -i "$OCF_RESKEY_target")
+	if [ $? -ne 0 ]; then
+		rm -f $tmpfn
 		ocf_log err "nfsclient:$OCF_RESKEY_name is missing!"
+		exit 1
 	fi
+
+        cat $tmpfn | tr -d "\n" | sed -e 's/([^)]*)/\n/g' | grep -q \
+		"^${OCF_RESKEY_path}[\t ]*.*${OCF_RESKEY_target_tmp}" 
+	rv=$? 
+
+	rm -f $tmpfn
+
+	if [ $rv -eq 0 ]; then
+		exit 0
+	fi
+
+	ocf_log err "nfsclient:$OCF_RESKEY_name is missing!"
+	exit 1
 	;;
 
 recover)

From lhh at redhat.com  Wed Jul 11 14:22:42 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 10:22:42 -0400
Subject: [Linux-cluster] rgmanager on debian
In-Reply-To: <1c3801c7c386$8a713f20$0a01a8c0@office.nodex.ru>
References: <1c3101c7c384$bfbcf680$0a01a8c0@office.nodex.ru>
	<1c3801c7c386$8a713f20$0a01a8c0@office.nodex.ru>
Message-ID: <20070711142242.GN18076@redhat.com>

On Wed, Jul 11, 2007 at 10:41:41AM +0400, Pavel D. Kuzin wrote:
> This problem was fixed.

That's good.

> Seems it was because in fstab was record /dev/Hosting/Hosting_data and 
> rgmanager tryes to mount /dev/mapper/Hosting-Hosting_data.

That's strange; fstab entries shouldn't matter.

> Now i have another question.
> How i can start one service on both nodes simultaneously ?

There's no way to start two copies of the same service... but:

* clusterfs can be referenced multiple times, and
* scripts can be referenced multiple times.

(Note: IP addresses can not be referenced multiple times and/or started
on two nodes...)

So, you can build something like:

  <resources>
    <clusterfs name="mygfs" device... mountpoint... options...
      force_unmount="0"/> <!-- force_unmount=0 is important -->
	<script name="myscript" path.../>
  <service name="foo-instance1">
    <clusterfs ref="mygfs"/>
    <script ref="myscript"/>
    <ip address="10.1.1.1"/>
  </service>
  <service name="foo-instance2">
    <clusterfs ref="mygfs"/>
    <script ref="myscript"/>
    <ip address="10.1.1.2"/>
  </service>

The only thing to note is that often times, the application "myscript"
might try to bind to INADDR_ANY - preventing the two services from
running on the same node.  Additionally, most scripts will stomp on each
other during the "stop" phase if stopped on the same node - something to
be aware of.  (You can get around this by storing say
OCF_RESKEY_service_name and using that to control instances individually
in your script...)

Generally, the two services should be able to operate independently -
even if sharing resources

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 11 14:32:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 10:32:30 -0400
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
	<BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
Message-ID: <20070711143225.GO18076@redhat.com>

On Wed, Jul 11, 2007 at 08:44:56AM +0000, mehmet celik wrote:
> 
> hi, firstly, the piranha is not supporting lvs/dr or lvs/tun right now. may 
> in future..
> 
> the following paragraph is from
> http://www.redhat.com/support/resources/howto/piranha/

Hmm, I didn't even know that existed.  It's outdated :)

You can do direct routing; it just requires some additional steps on
the real servers.

http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Virtual_Server_Administration/s1-lvs-direct-VSA.html

I'll see about getting the howto updated.

(You can do it in RHCS4 too; I don't know about RHCS3)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 11 14:33:08 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 10:33:08 -0400
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <6cfbd1b40707110510x69096d47ya03db1834b9217c1@mail.gmail.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
	<BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
	<6cfbd1b40707110510x69096d47ya03db1834b9217c1@mail.gmail.com>
Message-ID: <20070711143308.GP18076@redhat.com>

On Wed, Jul 11, 2007 at 02:10:14PM +0200, Tomas Hoger wrote:
> maybe it's out of date

It is.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 11 14:38:03 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 10:38:03 -0400
Subject: [Linux-cluster] Redhat cluster and GFS node limit
In-Reply-To: <430395.53669.qm@web32210.mail.mud.yahoo.com>
References: <430395.53669.qm@web32210.mail.mud.yahoo.com>
Message-ID: <20070711143803.GQ18076@redhat.com>

On Wed, Jul 11, 2007 at 03:34:24AM -0700, Hal wrote:
> Hi all,
> 
> I have a project which requires a cluster of over 100 GFS nodes. GFS supports
> them, but RH cluster supports only 16 nodes. For this case the documentation
> says:
> 
> "Red Hat GFS is scalable up to 300 nodes, a Red Hat Cluster Manager limits the
> total number of nodes in a cluster to 16. Therefore, in this scenario, Red Hat
> GFS scalability is limited. If the 16-node limit is too small for your
> deployment, you may want to consider using multiple Red Hat Cluster Manager
> clusters."
> 
> So any docs or ideas on how to build "multiple Red Hat Cluster Manager
> clusters"? 

The documentation is a hair unclear here, I guess.

You can build a 100 node GFS cluster.  In the case of the documentation
above, it is referring to just the failover component (not related to
GFS).

That is, you can...

* build a 100 node cluster
* mount GFS on 100 nodes
* Access files from all 100 nodes
* start the "service manager" ("Cluster Manager") portion on 16

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From kristoffer.lippert at jppol.dk  Wed Jul 11 14:41:32 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Wed, 11 Jul 2007 16:41:32 +0200
Subject: [Linux-cluster] LVS cluster
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A9D@exchsrv07.rootdom.dk>

One fairly simple question for a change. :-)

We have two frontend webservers running a cluster to handle a SAN (gfs).
Would it be possible to run the LVS router on the same physical machines
as the cluster? Thus creating a fully redundant loadbalanced cluster of
our two frontend webservers.

Currently we run with dns RoundRobin "load balancing". (Wich is not so
good in failover situations).

Thanks in advance :-)
Mvh / Kind regards

Kristoffer Lippert
Systemansvarlig
JP/Politiken A/S
Online Magasiner

Tlf. +45 8738 3032
Cell. +45 6062 8703

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070711/1adf2861/attachment.htm>

From tomas.hoger at gmail.com  Wed Jul 11 15:02:43 2007
From: tomas.hoger at gmail.com (Tomas Hoger)
Date: Wed, 11 Jul 2007 17:02:43 +0200
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <20070711143225.GO18076@redhat.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
	<BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
	<20070711143225.GO18076@redhat.com>
Message-ID: <6cfbd1b40707110802y239fa1a0j870cdccbd61063ec@mail.gmail.com>

On 7/11/07, Lon Hohberger <lhh at redhat.com> wrote:
> (You can do it in RHCS4 too; I don't know about RHCS3)

It's also possible to use LVS-DR on RHEL/RHCS3.

Regarding documentation: It only mentions arptables as ARP problem
fix.  For kernel 2.6+, there is also another way using sysctl, which
does not require additional kernel patches / modules.

th.



From tomas.hoger at gmail.com  Wed Jul 11 15:12:49 2007
From: tomas.hoger at gmail.com (Tomas Hoger)
Date: Wed, 11 Jul 2007 17:12:49 +0200
Subject: [Linux-cluster] LVS cluster
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A9D@exchsrv07.rootdom.dk>
References: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A9D@exchsrv07.rootdom.dk>
Message-ID: <6cfbd1b40707110812y1c3465d8m2472dc5c062c8dba@mail.gmail.com>

On 7/11/07, Kristoffer Lippert <kristoffer.lippert at jppol.dk> wrote:
> One fairly simple question for a change. :-)
>
> We have two frontend webservers running a cluster to handle a SAN (gfs).
> Would it be possible to run the LVS router on the same physical machines as
> the cluster? Thus creating a fully redundant loadbalanced cluster of our two
> frontend webservers.
>
> Currently we run with dns RoundRobin "load balancing". (Wich is not so good
> in failover situations).

Technically yes, but I'm not sure if it's supported ;).  You should be
able to find some examples in LVS howto or on ultramonkey site.

Should be easy to implement using RHCS/piranha, if you have RHEL4 or 5.

th.



From Danny.Wall at health-first.org  Wed Jul 11 15:15:54 2007
From: Danny.Wall at health-first.org (Danny Wall)
Date: Wed, 11 Jul 2007 11:15:54 -0400
Subject: [Linux-cluster] Backups with GFS and RAM sizing
Message-ID: <4694BBEA020000C800002C86@gwia.health-first.org>

My backups are taking a very long time, and I am looking for ways to
optimize the process, so suggestions are welcome.

The clusters consist of:
2 x IBM x335 Dual Proc 3.6Ghz XEON
8GB RAM

RHEL4U4 running off local 73GB U320 SCSI in hardware mirror. We are
running RHCS with GFS and DLM. Data that is being backed up is on an IBM
SAN. The slow backups are on an IBM DS4800, Fibre Channel 2Gbps. The
DS4800 has 16GB cache, and is not being over utilized.

The data is divided in to ten 2TB chunks, and mounted under /data/T1
through /data/T10. The backup software is IBM Tivoli Storage Manager.
When the backups start, it processes every file to determine if it needs
to be backed up or not. Last night, it took 15 hours to process 6.7
million files, then back up 4200 files (9GB) total. I do not currently
know how long it takes to actually back up 9GB, but standard copies
would be done relatively quick over the gig ethernet. The Tivoli backup
servers caches the data on seperate SAN disks before backing up to tape,
so the slowdown is not there.

>From what I can tell, the slowness is only on the Red Hat servers,
during processing. Comparing this to some AIX servers with large
backups, the AIX servers can scan 12 million files in about 5 hours, and
a Netware server scanned 17.2 million files in 16 hours. The AIX is
difficult to compare, since it is totally different hardware, but the
Netware server is the same model server, with only 1GB RAM, and using
1.5TB on FC SATA on an IBM FastT 100. If the Netware server takes the
same time to scan more files with less RAM and slower disks, why are my
Linux servers so slow. I know Netware has excellent disk I/O, but this
seems to be more of a processing issue. I don't think the content or
size of the files should matter, but according to our backup admin,
Tivoli will check some attributes (file size, date, rights, etc.) to see
if there are changes.

I am looking for backup client optimizations, but would also like to see
what others are doing or can suggest. The CPU is ranging between
80-100%, so I assume it is hitting both processors. If I try manual
copies from this server during backups, a copy that should take 10
seconds takes 10 minutes. I moved the share to the server not performing
backups, but using the same GFS storage locations, and the copy takes 10
seconds, so the SAN does not appear to be a problem. The slowness
appears to be in the file scan stage, to determine what needs to be
backed up. 

Is there any way I can optimize the disk access, RAM or processor that
might benefit? I am considering adding a server to split up the load, so
I could potentially have two servers with two samba shares each, and the
third server could provide failover and backup services. If adding RAM
would help, I am open to that as well. Additional CPUs might help, since
the utilization is 80-100 during backups, but I will have to purchase
new servers and move everything, which is not appealing. 

If I run free:

             total       used       free     shared    buffers
cached
Mem:       4040864    4020716      20148          0      20512
183012
-/+ buffers/cache:    3817192     223672
Swap:      2097144        224    2096920

Since swap is not really being used, I assume the RAM is being used for
file cache, which makes it hard to determine how much RAM is actually
available for processing. Are there any guidelines I can use to help me
properly size the server (specifically RAM), based on the number of
files or size of data? I recently upgraded from 4GB to 8GB, because I
would occasionally run out of memory on the servers. 


Since Tivoli does some comparison of various attributes during
processing, is it possible I am seeing problems related to the clustered
file system(ie. du -sh on /data/T1 takes minutes the first time)? Any
way to speed this up? Are others using snapshot pools or some other
backup method? Thanks






From chawkins at veracitynetworks.com  Wed Jul 11 16:01:25 2007
From: chawkins at veracitynetworks.com (Christopher Hawkins)
Date: Wed, 11 Jul 2007 12:01:25 -0400
Subject: [Linux-cluster] LVS cluster
In-Reply-To: <6cfbd1b40707110812y1c3465d8m2472dc5c062c8dba@mail.gmail.com>
Message-ID: <200707111543.l6BFhBPH029186@mail2.ontariocreditcorp.com>

 > -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tomas Hoger
> Sent: Wednesday, July 11, 2007 11:13 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] LVS cluster
> 
> On 7/11/07, Kristoffer Lippert <kristoffer.lippert at jppol.dk> wrote:
> > One fairly simple question for a change. :-)
> >
> > We have two frontend webservers running a cluster to handle 
> a SAN (gfs).
> > Would it be possible to run the LVS router on the same physical 
> > machines as the cluster? Thus creating a fully redundant 
> loadbalanced 
> > cluster of our two frontend webservers.
> >
> > Currently we run with dns RoundRobin "load balancing". 
> (Wich is not so 
> > good in failover situations).
> 
> Technically yes, but I'm not sure if it's supported ;).  You 
> should be able to find some examples in LVS howto or on 
> ultramonkey site.
> 
> Should be easy to implement using RHCS/piranha, if you have 
> RHEL4 or 5.

I'd suggest looking into keepalived as well. It (or something similar) will
perform service checks and will add / remove nodes from the LVS tables on
the fly if they go online / offline. That should fix your failover issue. 

Chris  



From Christopher.Barry at qlogic.com  Wed Jul 11 16:29:03 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Wed, 11 Jul 2007 12:29:03 -0400
Subject: [Linux-cluster] strange slowness of ls with 1 newly created
	file on	gfs 1 or 2
In-Reply-To: <46943F2A.3080707@redhat.com>
References: <46941A71.7010602@blackhole.sk>  <46943F2A.3080707@redhat.com>
Message-ID: <1184171343.5216.7.camel@localhost>

On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:
> Pavel Stano wrote:
> 
> >and then run touch on node 1:
> >serpico# touch /d/0/test
> >
> >and ls on node 2:
> >dinorscio:~# time ls /d/0/
> >test
> >
> >  
> >
> 
> What have you expected from a cluster filesystem ? When you touch a file 
> on node 1, it is a "create" that requires at least 2 exclusive locks 
> (directory lock and the file lock itself, among many other things). On a 
> local filesystem such as ext3, disk activities are delayed due to 
> filesystem cache where "touch" writes the data into cache and "ls" reads 
> it from cache on the very same node - all memory operations.  On cluster 
> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
> release the locks (few ping-pong messages between two nodes and lock 
> managers via network), the contents inside node 1's cache need to get 
> synced to the shared storage. After node 2 gets the locks, it  has to 
> read contents from the disk.
> 
> I hope the above explanation is clear.
> 
> >and last thing, i try gfs2, but same result
> >
> >
> >  
> >
> -- Wendy

This seems a little bit odd to me. I'm running a RH 7.3 cluster,
pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
since ~2002.

Here's the timing I get for the same basic test between two nodes:

[root at sbc1 root]# cd /mnt/gfs/workspace/cbarry/
[root at sbc1 cbarry]# mkdir tst
[root at sbc1 cbarry]# cd tst
[root at sbc1 tst]# time touch testfile

real    0m0.094s
user    0m0.000s
sys     0m0.000s
[root at sbc1 tst]# time ls -la testfile
-rw-r--r--    1 root     root            0 Jul 11 12:20 testfile

real    0m0.122s
user    0m0.010s
sys     0m0.000s
[root at sbc1 tst]#

Then immediately from the other node:

[root at sbc2 root]# cd /mnt/gfs/workspace/cbarry/
[root at sbc2 cbarry]# time ls -la tst
total 12
drwxr-xr-x    2 root     root         3864 Jul 11 12:20 .
drwxr-xr-x    4 cbarry   cbarry       3864 Jul 11 12:20 ..
-rw-r--r--    1 root     root            0 Jul 11 12:20 testfile

real    0m0.088s
user    0m0.010s
sys     0m0.000s
[root at sbc2 cbarry]#


Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
just does not fly. My guess is DLM is causing problems.



-- 
Regards,
-C

Christopher Barry
Systems Engineer, Principal



From grimme at atix.de  Wed Jul 11 16:40:02 2007
From: grimme at atix.de (Marc Grimme)
Date: Wed, 11 Jul 2007 18:40:02 +0200
Subject: [Linux-cluster] LVS cluster
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A9D@exchsrv07.rootdom.dk>
References: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A9D@exchsrv07.rootdom.dk>
Message-ID: <200707111840.03560.grimme@atix.de>

On Wednesday 11 July 2007 16:41:32 Kristoffer Lippert wrote:
> One fairly simple question for a change. :-)
>
> We have two frontend webservers running a cluster to handle a SAN (gfs).
> Would it be possible to run the LVS router on the same physical machines
> as the cluster? Thus creating a fully redundant loadbalanced cluster of
> our two frontend webservers.
>
> Currently we run with dns RoundRobin "load balancing". (Wich is not so
> good in failover situations).
>
> Thanks in advance :-)
> Mvh / Kind regards
>
> Kristoffer Lippert
> Systemansvarlig
> JP/Politiken A/S
> Online Magasiner
>
> Tlf. +45 8738 3032
> Cell. +45 6062 8703

we've done this with keepalive (also) and some rgmanager skripts. It runs on 
the cluster as ha-service and makes loadbalancing.

Regards Marc.

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From wcheng at redhat.com  Wed Jul 11 17:01:55 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 11 Jul 2007 13:01:55 -0400
Subject: [Linux-cluster] strange slowness of ls with 1 newly created	file
	on	gfs 1 or 2
In-Reply-To: <1184171343.5216.7.camel@localhost>
References: <46941A71.7010602@blackhole.sk> <46943F2A.3080707@redhat.com>
	<1184171343.5216.7.camel@localhost>
Message-ID: <46950D03.4000302@redhat.com>

Christopher Barry wrote:
> On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:
>   
>> Pavel Stano wrote:
>>
>>     
>>> and then run touch on node 1:
>>> serpico# touch /d/0/test
>>>
>>> and ls on node 2:
>>> dinorscio:~# time ls /d/0/
>>> test
>>>
>>>  
>>>
>>>       
>> What have you expected from a cluster filesystem ? When you touch a file 
>> on node 1, it is a "create" that requires at least 2 exclusive locks 
>> (directory lock and the file lock itself, among many other things). On a 
>> local filesystem such as ext3, disk activities are delayed due to 
>> filesystem cache where "touch" writes the data into cache and "ls" reads 
>> it from cache on the very same node - all memory operations.  On cluster 
>> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
>> release the locks (few ping-pong messages between two nodes and lock 
>> managers via network), the contents inside node 1's cache need to get 
>> synced to the shared storage. After node 2 gets the locks, it  has to 
>> read contents from the disk.
>>
>> I hope the above explanation is clear.
>>
>>     
>>> and last thing, i try gfs2, but same result
>>>
>>>
>>>  
>>>
>>>       
>> -- Wendy
>>     
>
> This seems a little bit odd to me. I'm running a RH 7.3 cluster,
> pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
> since ~2002.
>
> Here's the timing I get for the same basic test between two nodes:
>
> [root at sbc1 root]# cd /mnt/gfs/workspace/cbarry/
> [root at sbc1 cbarry]# mkdir tst
> [root at sbc1 cbarry]# cd tst
> [root at sbc1 tst]# time touch testfile
>
> real    0m0.094s
> user    0m0.000s
> sys     0m0.000s
> [root at sbc1 tst]# time ls -la testfile
> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
>
> real    0m0.122s
> user    0m0.010s
> sys     0m0.000s
> [root at sbc1 tst]#
>
> Then immediately from the other node:
>
> [root at sbc2 root]# cd /mnt/gfs/workspace/cbarry/
> [root at sbc2 cbarry]# time ls -la tst
> total 12
> drwxr-xr-x    2 root     root         3864 Jul 11 12:20 .
> drwxr-xr-x    4 cbarry   cbarry       3864 Jul 11 12:20 ..
> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
>
> real    0m0.088s
> user    0m0.010s
> sys     0m0.000s
> [root at sbc2 cbarry]#
>
>
> Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
> just does not fly. My guess is DLM is causing problems.
>
>   
 From previous post, we really can't tell since the network and disk 
speeds are variables and unknown. However, look at your data:

local "ls" is 0.122s
remote "ls" is 0.088s

I bet the disk flushing happened during first "ls" (and different base 
kernels treat their dirty data flush and IO scheduling differently). I 
can't be convinced that DLM is an issue - unless the experiment has 
collected enough sample that has its statistical significance.

-- Wendy




From lhh at redhat.com  Wed Jul 11 20:36:22 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 11 Jul 2007 16:36:22 -0400
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <6cfbd1b40707110802y239fa1a0j870cdccbd61063ec@mail.gmail.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
	<BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
	<20070711143225.GO18076@redhat.com>
	<6cfbd1b40707110802y239fa1a0j870cdccbd61063ec@mail.gmail.com>
Message-ID: <20070711203622.GW18076@redhat.com>

On Wed, Jul 11, 2007 at 05:02:43PM +0200, Tomas Hoger wrote:
> On 7/11/07, Lon Hohberger <lhh at redhat.com> wrote:
> >(You can do it in RHCS4 too; I don't know about RHCS3)
> 
> It's also possible to use LVS-DR on RHEL/RHCS3.
> 
> Regarding documentation: It only mentions arptables as ARP problem
> fix.  For kernel 2.6+, there is also another way using sysctl, which
> does not require additional kernel patches / modules.

There is an IPtables one in the docs too, which works on both 2.4.x and
2.6.x IIRC.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From Christopher.Barry at qlogic.com  Wed Jul 11 20:59:09 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Wed, 11 Jul 2007 16:59:09 -0400
Subject: [Linux-cluster] strange slowness of ls with 1 newly
	created	file on	gfs 1 or 2
In-Reply-To: <46950D03.4000302@redhat.com>
References: <46941A71.7010602@blackhole.sk> <46943F2A.3080707@redhat.com>
	<1184171343.5216.7.camel@localhost>  <46950D03.4000302@redhat.com>
Message-ID: <1184187549.12134.38.camel@localhost>

On Wed, 2007-07-11 at 13:01 -0400, Wendy Cheng wrote:
> Christopher Barry wrote:
> > On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:
> >   
> >> Pavel Stano wrote:
> >>
> >>     
> >>> and then run touch on node 1:
> >>> serpico# touch /d/0/test
> >>>
> >>> and ls on node 2:
> >>> dinorscio:~# time ls /d/0/
> >>> test
> >>>
> >>>  
> >>>
> >>>       
> >> What have you expected from a cluster filesystem ? When you touch a file 
> >> on node 1, it is a "create" that requires at least 2 exclusive locks 
> >> (directory lock and the file lock itself, among many other things). On a 
> >> local filesystem such as ext3, disk activities are delayed due to 
> >> filesystem cache where "touch" writes the data into cache and "ls" reads 
> >> it from cache on the very same node - all memory operations.  On cluster 
> >> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
> >> release the locks (few ping-pong messages between two nodes and lock 
> >> managers via network), the contents inside node 1's cache need to get 
> >> synced to the shared storage. After node 2 gets the locks, it  has to 
> >> read contents from the disk.
> >>
> >> I hope the above explanation is clear.
> >>
> >>     
> >>> and last thing, i try gfs2, but same result
> >>>
> >>>
> >>>  
> >>>
> >>>       
> >> -- Wendy
> >>     
> >
> > This seems a little bit odd to me. I'm running a RH 7.3 cluster,
> > pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
> > since ~2002.
> >
> > Here's the timing I get for the same basic test between two nodes:
> >
> > [root at sbc1 root]# cd /mnt/gfs/workspace/cbarry/
> > [root at sbc1 cbarry]# mkdir tst
> > [root at sbc1 cbarry]# cd tst
> > [root at sbc1 tst]# time touch testfile
> >
> > real    0m0.094s
> > user    0m0.000s
> > sys     0m0.000s
> > [root at sbc1 tst]# time ls -la testfile
> > -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
> >
> > real    0m0.122s
> > user    0m0.010s
> > sys     0m0.000s
> > [root at sbc1 tst]#
> >
> > Then immediately from the other node:
> >
> > [root at sbc2 root]# cd /mnt/gfs/workspace/cbarry/
> > [root at sbc2 cbarry]# time ls -la tst
> > total 12
> > drwxr-xr-x    2 root     root         3864 Jul 11 12:20 .
> > drwxr-xr-x    4 cbarry   cbarry       3864 Jul 11 12:20 ..
> > -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
> >
> > real    0m0.088s
> > user    0m0.010s
> > sys     0m0.000s
> > [root at sbc2 cbarry]#
> >
> >
> > Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
> > just does not fly. My guess is DLM is causing problems.
> >
> >   
>  From previous post, we really can't tell since the network and disk 
> speeds are variables and unknown. However, look at your data:
> 
> local "ls" is 0.122s
> remote "ls" is 0.088s
> 
> I bet the disk flushing happened during first "ls" (and different base 
> kernels treat their dirty data flush and IO scheduling differently). I 
> can't be convinced that DLM is an issue - unless the experiment has 
> collected enough sample that has its statistical significance.
> 
> -- Wendy
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Where is all the time being spent? Certainly, it should not take 10
seconds.

Let me see if I get the series of events correct here, and you can
correct me where I'm wrong.

Node1:
touch is run, and asks (indirectly) for 2 exclusive write locks.
dlm grants the locks.
File is created into cache.
locks are released (now?)

local ls is run, and asks for read lock
dlm grants lock.
reads cache.
returns results to screen
lock is released

Node2:
remote ls is run, and asks for read lock
... what happens here?
I think your saying dlm looks at the lock request, and says I can't give
it to you, because the buffer has not been sync'd to disk yet.
Does node2 wait, and retry asking for the lock after some time period,
and do this in loop? Does the dlm on Node1 request the data be sync'd so
that the requesting Node2 can access the data faster?

If Pavel used dd to create a file, rather than touch, with a size larger
than the buffer, and then used ls on Node2, would this show far better
performance? Is the real issue the corner-case of a 0 byte file being
created?

Basically, I think you're saying that the kernel is keeping the 0 byte
touched file in cache, and GFS and/or dlm cannot help with this
situation. Is that correct?



-- 
Regards,
-C



From hal_bg at yahoo.com  Wed Jul 11 21:36:45 2007
From: hal_bg at yahoo.com (Hal)
Date: Wed, 11 Jul 2007 14:36:45 -0700 (PDT)
Subject: AW: [Linux-cluster] Redhat cluster and GFS node limit
In-Reply-To: <56825E6A1E167E41A80CE0F6F6916CAB0266A7A3@MNTSVCL10E.mmgmuc.de>
Message-ID: <994144.14043.qm@web32201.mail.mud.yahoo.com>

Thanks for the quick answers! But the answers raised few more questions.
Yes I need LVS to balance ~100 nodes which access the same data on GFS.
In this case what modules do I need from the RedHat Cluster? 
Do I need locks (DLM)? I think I do, don't I?
1.CCS?
2.GFS?
3.GFS-kernel?

Anything else?

Regards 
Hal

--- Schaefer Christian <Christian.Schaefer2 at messe-muenchen.de> wrote:

> Dear Hal,
> 
> this limitation means, that you can only run up to 16 rgmanagers in one
> cluster infrastructure. rgmanager thereby is only needed if you build HA
> services like a active/passive MySQL cluster services. If you like to build a
> 100 node webserver farm with external loadbalancing/failover - such as LVS
> and/or keepalived - no rgmanager is needed to be run on that cluster. 
> 
> In this case you can set up a 100 node cluster with shared gfs devices as you
> like. I too recommend you to have a look at
> http://open-sharedroot.org/documentation. The concept demonstarted on that
> page is do build a n node cluster from one filesystem single image which
> keeps administration simple.
> 
> Greetings,
> Christian
> 
> Best Regards, 
> 
> Christian Sch?fer
> Abt. IT-Applications
> Messe M?nchen 
> 
> Tel.: +49 (0) 89 949 21985 
> E-Mail: christian.schaefer2 at messe-muenchen.de 
> WWW: www.messe-muenchen.de 
> 
>  
> 
> > -----Urspr?ngliche Nachricht-----
> > Von: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Hal
> > Gesendet: Mittwoch, 11. Juli 2007 12:34
> > An: linux-cluster at redhat.com
> > Betreff: [Linux-cluster] Redhat cluster and GFS node limit
> > 
> > Hi all,
> > 
> > I have a project which requires a cluster of over 100 GFS 
> > nodes. GFS supports
> > them, but RH cluster supports only 16 nodes. For this case 
> > the documentation
> > says:
> > 
> > "Red Hat GFS is scalable up to 300 nodes, a Red Hat Cluster 
> > Manager limits the
> > total number of nodes in a cluster to 16. Therefore, in this 
> > scenario, Red Hat
> > GFS scalability is limited. If the 16-node limit is too small for your
> > deployment, you may want to consider using multiple Red Hat 
> > Cluster Manager
> > clusters."
> > 
> > So any docs or ideas on how to build "multiple Red Hat Cluster Manager
> > clusters"? 
> > Will this solution enable all 100+ nodes to access the same 
> > File system? 
> > 
> > 
> > Thank you!
> > Hal
> > 
> > 
> >  
> > ______________________________________________________________
> > ______________________
> > Looking for earth-friendly autos? 
> > Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.
> > http://autos.yahoo.com/green_center/
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



       
____________________________________________________________________________________
Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/



From dennis at demarco.com  Wed Jul 11 22:02:55 2007
From: dennis at demarco.com (dennis at demarco.com)
Date: Wed, 11 Jul 2007 18:02:55 -0400 (EDT)
Subject: [Linux-cluster] NFSCookbook w/ Redhat 5.0 Cluster
Message-ID: <Pine.LNX.4.64.0707111758330.21046@lycaeum.internal.lan>



I'm playing around with the nfscookbook on a test cluster. I'm finding 
some really odd behavior. I can't seem to get the services to 'stick' the 
cluster node with the lowest priority.

When a cluster node starts a services, it seems to re-locate other 
services for no apparent reasons. Have I've done something wrong?

Thanks,
Dennis







Below is my config file:




<?xml version="1.0"?>
<cluster alias="cluster1" config_version="113" name="cluster1">
         <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="12"/>
         <clusternodes>
                 <clusternode name="node03.internal.lan" nodeid="1" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node03" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="node01.internal.lan" nodeid="2" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node01" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="node02.internal.lan" nodeid="3" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node2" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman/>
         <fencedevices>
                 <fencedevice agent="fence_xvm" name="xen-fence"/>
         </fencedevices>
         <rm>
                 <failoverdomains>
                         <failoverdomain name="grid1" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="node03.internal.lan" priority="3"/>
                                 <failoverdomainnode 
name="node01.internal.lan" priority="1"/>
                                 <failoverdomainnode 
name="node02.internal.lan" priority="2"/>
                         </failoverdomain>
                         <failoverdomain name="grid2" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="node03.internal.lan" priority="2"/>
                                 <failoverdomainnode 
name="node01.internal.lan" priority="3"/>
                                 <failoverdomainnode 
name="node02.internal.lan" priority="1"/>
                         </failoverdomain>
                         <failoverdomain name="grid3" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="node03.internal.lan" priority="1"/>
                                 <failoverdomainnode 
name="node01.internal.lan" priority="2"/>
                                 <failoverdomainnode 
name="node02.internal.lan" priority="3"/>
                         </failoverdomain>
                 </failoverdomains>
                 <resources>
                         <ip address="192.168.1.23" monitor_link="1"/>
                         <ip address="192.168.1.24" monitor_link="1"/>
                         <ip address="192.168.1.25" monitor_link="1"/>
                         <nfsexport name="nfsexport1"/>
                         <nfsexport name="nfsexport2"/>
                         <nfsexport name="nfsexport3"/>
                         <nfsclient allow_recover="0" name="nfsclient1" 
target="*"/>
                         <nfsclient allow_recover="0" name="nfsclient2" 
target="*"/>
                         <nfsclient allow_recover="0" name="nfsclient3" 
target="*"/>
                         <clusterfs device="/dev/vg0/gfslv2" 
force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfsdata" name="
gfs" options="acl"/>
                 </resources>
                 <service autostart="1" domain="grid1" exclusive="0" 
name="nfs1" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport1">
                                         <nfsclient ref="nfsclient1"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.23"/>
                 </service>
                 <service autostart="1" domain="grid2" exclusive="0" 
name="nfs2" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport2">
                                         <nfsclient ref="nfsclient2"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.24"/>
                 </service>
                 <service autostart="1" domain="grid3" exclusive="0" 
name="nfs3" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport3">
                                         <nfsclient ref="nfsclient3"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.25"/>
                 </service>
         </rm>
</cluster>
[root at node01 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="cluster1" config_version="113" name="cluster1">
         <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="12"/>
         <clusternodes>
                 <clusternode name="node03.internal.lan" nodeid="1" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node03" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="node01.internal.lan" nodeid="2" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node01" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="node02.internal.lan" nodeid="3" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node2" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman/>
         <fencedevices>
                 <fencedevice agent="fence_xvm" name="xen-fence"/>
         </fencedevices>
         <rm>
                 <failoverdomains>
                         <failoverdomain name="grid1" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="node03.internal.lan" priority="3"/>
                                 <failoverdomainnode 
name="node01.internal.lan" priority="1"/>
                                 <failoverdomainnode 
name="node02.internal.lan" priority="2"/>
                         </failoverdomain>
                         <failoverdomain name="grid2" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="node03.internal.lan" priority="2"/>
                                 <failoverdomainnode 
name="node01.internal.lan" priority="3"/>
                                 <failoverdomainnode 
name="node02.internal.lan" priority="1"/>
                         </failoverdomain>
                         <failoverdomain name="grid3" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="node03.internal.lan" priority="1"/>
                                 <failoverdomainnode 
name="node01.internal.lan" priority="2"/>
                                 <failoverdomainnode 
name="node02.internal.lan" priority="3"/>
                         </failoverdomain>
                 </failoverdomains>
                 <resources>
                         <ip address="192.168.1.23" monitor_link="1"/>
                         <ip address="192.168.1.24" monitor_link="1"/>
                         <ip address="192.168.1.25" monitor_link="1"/>
                         <nfsexport name="nfsexport1"/>
                         <nfsexport name="nfsexport2"/>
                         <nfsexport name="nfsexport3"/>
                         <nfsclient allow_recover="0" name="nfsclient1" 
target="*"/>
                         <nfsclient allow_recover="0" name="nfsclient2" 
target="*"/>
                         <nfsclient allow_recover="0" name="nfsclient3" 
target="*"/>
                         <clusterfs device="/dev/vg0/gfslv2" 
force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfsdata" 
name="gfs" options="acl"/>
                 </resources>
                 <service autostart="1" domain="grid1" exclusive="0" 
name="nfs1" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport1">
                                         <nfsclient ref="nfsclient1"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.23"/>
                 </service>
                 <service autostart="1" domain="grid2" exclusive="0" 
name="nfs2" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport2">
                                         <nfsclient ref="nfsclient2"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.24"/>
                 </service>
                 <service autostart="1" domain="grid3" exclusive="0" 
name="nfs3" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport3">
                                         <nfsclient ref="nfsclient3"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.25"/>
                 </service>
         </rm>
</cluster>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From wcheng at redhat.com  Wed Jul 11 22:03:37 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 11 Jul 2007 18:03:37 -0400
Subject: [Linux-cluster] strange slowness of ls with 1 newly	created	file
	on	gfs 1 or 2
In-Reply-To: <1184187549.12134.38.camel@localhost>
References: <46941A71.7010602@blackhole.sk>
	<46943F2A.3080707@redhat.com>	<1184171343.5216.7.camel@localhost>
	<46950D03.4000302@redhat.com> <1184187549.12134.38.camel@localhost>
Message-ID: <469553B9.5030004@redhat.com>

Christopher Barry wrote:
> On Wed, 2007-07-11 at 13:01 -0400, Wendy Cheng wrote:
>   
>> Christopher Barry wrote:
>>     
>>> On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:
>>>   
>>>       
>>>> Pavel Stano wrote:
>>>>
>>>>     
>>>>         
>>>>> and then run touch on node 1:
>>>>> serpico# touch /d/0/test
>>>>>
>>>>> and ls on node 2:
>>>>> dinorscio:~# time ls /d/0/
>>>>> test
>>>>>
>>>>>  
>>>>>
>>>>>       
>>>>>           
>>>> What have you expected from a cluster filesystem ? When you touch a file 
>>>> on node 1, it is a "create" that requires at least 2 exclusive locks 
>>>> (directory lock and the file lock itself, among many other things). On a 
>>>> local filesystem such as ext3, disk activities are delayed due to 
>>>> filesystem cache where "touch" writes the data into cache and "ls" reads 
>>>> it from cache on the very same node - all memory operations.  On cluster 
>>>> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
>>>> release the locks (few ping-pong messages between two nodes and lock 
>>>> managers via network), the contents inside node 1's cache need to get 
>>>> synced to the shared storage. After node 2 gets the locks, it  has to 
>>>> read contents from the disk.
>>>>
>>>> I hope the above explanation is clear.
>>>>
>>>>     
>>>>         
>>>>> and last thing, i try gfs2, but same result
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>>       
>>>>>           
>>>> -- Wendy
>>>>     
>>>>         
>>> This seems a little bit odd to me. I'm running a RH 7.3 cluster,
>>> pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
>>> since ~2002.
>>>
>>> Here's the timing I get for the same basic test between two nodes:
>>>
>>> [root at sbc1 root]# cd /mnt/gfs/workspace/cbarry/
>>> [root at sbc1 cbarry]# mkdir tst
>>> [root at sbc1 cbarry]# cd tst
>>> [root at sbc1 tst]# time touch testfile
>>>
>>> real    0m0.094s
>>> user    0m0.000s
>>> sys     0m0.000s
>>> [root at sbc1 tst]# time ls -la testfile
>>> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
>>>
>>> real    0m0.122s
>>> user    0m0.010s
>>> sys     0m0.000s
>>> [root at sbc1 tst]#
>>>
>>> Then immediately from the other node:
>>>
>>> [root at sbc2 root]# cd /mnt/gfs/workspace/cbarry/
>>> [root at sbc2 cbarry]# time ls -la tst
>>> total 12
>>> drwxr-xr-x    2 root     root         3864 Jul 11 12:20 .
>>> drwxr-xr-x    4 cbarry   cbarry       3864 Jul 11 12:20 ..
>>> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
>>>
>>> real    0m0.088s
>>> user    0m0.010s
>>> sys     0m0.000s
>>> [root at sbc2 cbarry]#
>>>
>>>
>>> Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
>>> just does not fly. My guess is DLM is causing problems.
>>>
>>>   
>>>       
>>  From previous post, we really can't tell since the network and disk 
>> speeds are variables and unknown. However, look at your data:
>>
>> local "ls" is 0.122s
>> remote "ls" is 0.088s
>>
>> I bet the disk flushing happened during first "ls" (and different base 
>> kernels treat their dirty data flush and IO scheduling differently). I 
>> can't be convinced that DLM is an issue - unless the experiment has 
>> collected enough sample that has its statistical significance.
>>
>> -- Wendy
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>   

ok :) I admire your curiosity. I'm not saying 10 seconds is ok. I'm 
saying one single command doesn't imply anything (since there are so 
many variables there). You need to try out few more runs before 
concluding anything is wrong.
> Where is all the time being spent? Certainly, it should not take 10
> seconds.
>
> Let me see if I get the series of events correct here, and you can
> correct me where I'm wrong.
>
> Node1:
> touch is run, and asks (indirectly) for 2 exclusive write locks.
> dlm grants the locks.
> File is created into cache.
> locks are released (now?)
>   
Not necessarily (if there is no other request pending, GFS caches the 
locks assuming next request will be most likely from this node).
> local ls is run, and asks for read lock
> dlm grants lock.
> reads cache.
> returns results to screen
> lock is released
>   
In your case, the lock was downgraded from write to read; file was 
flushed; all within local node before remote "ls" was issued. This is 
different from previous post. Previous poster didn't do an "ls" so he 
paid the price for extra network traffic, plus the synchronization 
(wait) cost (waiting for lock manager to communicate and file sync to 
disk). And remember lock manager is implemented as daemon. You send the 
daemon a message and it may not be waken up in time to receive the 
message . A lot of variables there.
> Node2:
> remote ls is run, and asks for read lock
> ... what happens here?
>   
DLM sends messages (via network) to node 1 to ask for lock. After lock 
is granted, GFS reads the file from the disk.
> I think your saying dlm looks at the lock request, and says I can't give
> it to you, because the buffer has not been sync'd to disk yet.
>   
No, DLM says I need to ask whoever is holding the lock to release the 
lock. And GFS waits until lock is granted. Whoever owns the lock needs 
to do its action accordingly. If it is an exclusive lock, the file needs 
to get flushed before the lock can be shared.
> Does node2 wait, and retry asking for the lock after some time period,
> and do this in loop? Does the dlm on Node1 request the data be sync'd so
> that the requesting Node2 can access the data faster?
>   
It is not in a loop. It is an event-wait-wakeup logic.
> If Pavel used dd to create a file, rather than touch, with a size larger
> than the buffer, and then used ls on Node2, would this show far better
> performance? Is the real issue the corner-case of a 0 byte file being
> created?
>   
No, I don't think so. Not sure how "dd" is implemented internally from 
top of my head. However, remember "create" competes with "ls" for 
directory lock. But a file write itself doesn't compete with "ls" since 
it only requires file lock.  On the other hand, "ls -la" is another 
story - it requires file size so it will need the file (inode locks). So 
there is another variation there.
> Basically, I think you're saying that the kernel is keeping the 0 byte
> touched file in cache, and GFS and/or dlm cannot help with this
> situation. Is that correct?
>
>   
No, I'm not saying that. Again, I'm saying you need to run the command 
few times, instead of one time shot before concluding anything. Since 
there are simply too many variations and variables under-neath these 
simple "touch" and "ls" commands in a cluster environment.

-- Wendy



From Christopher.Barry at qlogic.com  Wed Jul 11 23:44:47 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Wed, 11 Jul 2007 19:44:47 -0400
Subject: [Linux-cluster] strange slowness of ls with 1
	newly	created	file on	gfs 1 or 2
In-Reply-To: <469553B9.5030004@redhat.com>
References: <46941A71.7010602@blackhole.sk> <46943F2A.3080707@redhat.com>
	<1184171343.5216.7.camel@localhost> <46950D03.4000302@redhat.com>
	<1184187549.12134.38.camel@localhost>  <469553B9.5030004@redhat.com>
Message-ID: <1184197488.12134.61.camel@localhost>

On Wed, 2007-07-11 at 18:03 -0400, Wendy Cheng wrote:
> Christopher Barry wrote:
> > On Wed, 2007-07-11 at 13:01 -0400, Wendy Cheng wrote:
> >   
> >> Christopher Barry wrote:
> >>     
> >>> On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:
> >>>   
> >>>       
> >>>> Pavel Stano wrote:
> >>>>
> >>>>     
> >>>>         
> >>>>> and then run touch on node 1:
> >>>>> serpico# touch /d/0/test
> >>>>>
> >>>>> and ls on node 2:
> >>>>> dinorscio:~# time ls /d/0/
> >>>>> test
> >>>>>
> >>>>>  
> >>>>>
> >>>>>       
> >>>>>           
> >>>> What have you expected from a cluster filesystem ? When you touch a file 
> >>>> on node 1, it is a "create" that requires at least 2 exclusive locks 
> >>>> (directory lock and the file lock itself, among many other things). On a 
> >>>> local filesystem such as ext3, disk activities are delayed due to 
> >>>> filesystem cache where "touch" writes the data into cache and "ls" reads 
> >>>> it from cache on the very same node - all memory operations.  On cluster 
> >>>> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
> >>>> release the locks (few ping-pong messages between two nodes and lock 
> >>>> managers via network), the contents inside node 1's cache need to get 
> >>>> synced to the shared storage. After node 2 gets the locks, it  has to 
> >>>> read contents from the disk.
> >>>>
> >>>> I hope the above explanation is clear.
> >>>>
> >>>>     
> >>>>         
> >>>>> and last thing, i try gfs2, but same result
> >>>>>
> >>>>>
> >>>>>  
> >>>>>
> >>>>>       
> >>>>>           
> >>>> -- Wendy
> >>>>     
> >>>>         
> >>> This seems a little bit odd to me. I'm running a RH 7.3 cluster,
> >>> pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
> >>> since ~2002.
> >>>
> >>> Here's the timing I get for the same basic test between two nodes:
> >>>
> >>> [root at sbc1 root]# cd /mnt/gfs/workspace/cbarry/
> >>> [root at sbc1 cbarry]# mkdir tst
> >>> [root at sbc1 cbarry]# cd tst
> >>> [root at sbc1 tst]# time touch testfile
> >>>
> >>> real    0m0.094s
> >>> user    0m0.000s
> >>> sys     0m0.000s
> >>> [root at sbc1 tst]# time ls -la testfile
> >>> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
> >>>
> >>> real    0m0.122s
> >>> user    0m0.010s
> >>> sys     0m0.000s
> >>> [root at sbc1 tst]#
> >>>
> >>> Then immediately from the other node:
> >>>
> >>> [root at sbc2 root]# cd /mnt/gfs/workspace/cbarry/
> >>> [root at sbc2 cbarry]# time ls -la tst
> >>> total 12
> >>> drwxr-xr-x    2 root     root         3864 Jul 11 12:20 .
> >>> drwxr-xr-x    4 cbarry   cbarry       3864 Jul 11 12:20 ..
> >>> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
> >>>
> >>> real    0m0.088s
> >>> user    0m0.010s
> >>> sys     0m0.000s
> >>> [root at sbc2 cbarry]#
> >>>
> >>>
> >>> Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
> >>> just does not fly. My guess is DLM is causing problems.
> >>>
> >>>   
> >>>       
> >>  From previous post, we really can't tell since the network and disk 
> >> speeds are variables and unknown. However, look at your data:
> >>
> >> local "ls" is 0.122s
> >> remote "ls" is 0.088s
> >>
> >> I bet the disk flushing happened during first "ls" (and different base 
> >> kernels treat their dirty data flush and IO scheduling differently). I 
> >> can't be convinced that DLM is an issue - unless the experiment has 
> >> collected enough sample that has its statistical significance.
> >>
> >> -- Wendy
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>     
> >
> >   
> 
> ok :) I admire your curiosity. I'm not saying 10 seconds is ok. I'm 
> saying one single command doesn't imply anything (since there are so 
> many variables there). You need to try out few more runs before 
> concluding anything is wrong.
> > Where is all the time being spent? Certainly, it should not take 10
> > seconds.
> >
> > Let me see if I get the series of events correct here, and you can
> > correct me where I'm wrong.
> >
> > Node1:
> > touch is run, and asks (indirectly) for 2 exclusive write locks.
> > dlm grants the locks.
> > File is created into cache.
> > locks are released (now?)
> >   
> Not necessarily (if there is no other request pending, GFS caches the 
> locks assuming next request will be most likely from this node).
> > local ls is run, and asks for read lock
> > dlm grants lock.
> > reads cache.
> > returns results to screen
> > lock is released
> >   
> In your case, the lock was downgraded from write to read; file was 
> flushed; all within local node before remote "ls" was issued. This is 
> different from previous post. Previous poster didn't do an "ls" so he 
> paid the price for extra network traffic, plus the synchronization 
> (wait) cost (waiting for lock manager to communicate and file sync to 
> disk). And remember lock manager is implemented as daemon. You send the 
> daemon a message and it may not be waken up in time to receive the 
> message . A lot of variables there.
> > Node2:
> > remote ls is run, and asks for read lock
> > ... what happens here?
> >   
> DLM sends messages (via network) to node 1 to ask for lock. After lock 
> is granted, GFS reads the file from the disk.
> > I think your saying dlm looks at the lock request, and says I can't give
> > it to you, because the buffer has not been sync'd to disk yet.
> >   
> No, DLM says I need to ask whoever is holding the lock to release the 
> lock. And GFS waits until lock is granted. Whoever owns the lock needs 
> to do its action accordingly. If it is an exclusive lock, the file needs 
> to get flushed before the lock can be shared.
> > Does node2 wait, and retry asking for the lock after some time period,
> > and do this in loop? Does the dlm on Node1 request the data be sync'd so
> > that the requesting Node2 can access the data faster?
> >   
> It is not in a loop. It is an event-wait-wakeup logic.
> > If Pavel used dd to create a file, rather than touch, with a size larger
> > than the buffer, and then used ls on Node2, would this show far better
> > performance? Is the real issue the corner-case of a 0 byte file being
> > created?
> >   
> No, I don't think so. Not sure how "dd" is implemented internally from 
> top of my head. However, remember "create" competes with "ls" for 
> directory lock. But a file write itself doesn't compete with "ls" since 
> it only requires file lock.  On the other hand, "ls -la" is another 
> story - it requires file size so it will need the file (inode locks). So 
> there is another variation there.
> > Basically, I think you're saying that the kernel is keeping the 0 byte
> > touched file in cache, and GFS and/or dlm cannot help with this
> > situation. Is that correct?
> >
> >   
> No, I'm not saying that. Again, I'm saying you need to run the command 
> few times, instead of one time shot before concluding anything. Since 
> there are simply too many variations and variables under-neath these 
> simple "touch" and "ls" commands in a cluster environment.
> 
> -- Wendy
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Thank You for the lesson Wendy. ;^)

Another question you'll likely know the answer to. Is there a preferred
IO Scheduler to use with GFS? 


-- 
Regards,
-C



From dennis at demarco.com  Thu Jul 12 01:27:51 2007
From: dennis at demarco.com (dennis at demarco.com)
Date: Wed, 11 Jul 2007 21:27:51 -0400 (EDT)
Subject: [Linux-cluster] Re: NFSCookbook Redhat 5 Cluster
Message-ID: <Pine.LNX.4.64.0707111756490.20938@lycaeum.internal.lan>


It's most likely bad ettique to reply to your own post, but I have figured 
out the issue.

If anyone is interested, here is a cluster.conf that seems to work w/ the 
nfscookbook whitepaper method.

It's a three node xen, running nfs services w/ active active 
configuration as a test.

- Dennis










<?xml version="1.0"?>
<cluster alias="cluster1" config_version="122" name="cluster1">
         <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="12"/>
         <clusternodes>
                 <clusternode name="node03.internal.lan" nodeid="1" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node03" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="node01.internal.lan" nodeid="2" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node01" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="node02.internal.lan" nodeid="3" 
votes="1">
                         <fence>
                                 <method name="1">
                                         <device domain="node2" 
name="xen-fence"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman/>
         <fencedevices>
                 <fencedevice agent="fence_xvm" name="xen-fence"/>
         </fencedevices>
         <rm>
                 <failoverdomains>
                         <failoverdomain name="perfer_1" ordered="0" 
restricted="0">
                                 <failoverdomainnode 
name="node01.internal.lan" priority="1"/>
                         </failoverdomain>
                         <failoverdomain name="perfer_2" ordered="0" 
restricted="0">
                                 <failoverdomainnode 
name="node02.internal.lan" priority="1"/>
                         </failoverdomain>
                         <failoverdomain name="perfer_3" ordered="0" 
restricted="0">
                                 <failoverdomainnode 
name="node03.internal.lan" priority="1"/>
                         </failoverdomain>
                 </failoverdomains>
                 <resources>
                         <ip address="192.168.1.23" monitor_link="1"/>
                         <ip address="192.168.1.24" monitor_link="1"/>
                         <ip address="192.168.1.25" monitor_link="1"/>
                         <nfsexport name="nfsexport1"/>
                         <nfsexport name="nfsexport2"/>
                         <nfsexport name="nfsexport3"/>
                         <nfsclient options="rw" name="nfsclient1" 
target="*"/>
                         <nfsclient options="rw" name="nfsclient2" 
target="*"/>
                         <nfsclient options="rw" name="nfsclient3" 
target="*"/>


                         <clusterfs device="/dev/vg0/gfslv2" 
force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfsdata" name="
gfs" options="acl"/>
                 </resources>
                 <service autostart="1" domain="perfer_1" exclusive="0" 
name="nfs1" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport1">
                                         <nfsclient ref="nfsclient1"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.23"/>
                 </service>
                 <service autostart="1" domain="perfer_2" exclusive="0" 
name="nfs2" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport2">
                                         <nfsclient ref="nfsclient2"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.24"/>
                 </service>
                 <service autostart="1" domain="perfer_3" exclusive="0" 
name="nfs3" recovery="relocate">
                         <clusterfs ref="gfs">
                                 <nfsexport ref="nfsexport3">
                                         <nfsclient ref="nfsclient3"/>
                                 </nfsexport>
                         </clusterfs>
                         <ip ref="192.168.1.25"/>
                 </service>
         </rm>
</cluster>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From superjunk at 126.com  Thu Jul 12 01:35:54 2007
From: superjunk at 126.com (Superjunk)
Date: Thu, 12 Jul 2007 09:35:54 +0800
Subject: [Linux-cluster] Clusterng with GFS
References: <465D7C28.4050607@hcl.in>
Message-ID: <002c01c7c424$fe40b9f0$9665a8c0@jklau>

test
----- Original Message ----- 
From: "Daniel Fernanduz" <daniel.j at hcl.in>
To: <linux-cluster at redhat.com>
Sent: Wednesday, May 30, 2007 9:29 PM
Subject: [Linux-cluster] Clusterng with GFS


> 
> Hello
> 
> I am trying to implement a failover cluster with shared storage concept 
> (Postgresql Database 7.4).
> RHEL AS 4.0 + RHCS + GFS
> 
> I am having two gfs volumes which gets mounted by cluster when any of 
> the nodes in
> cluster takes control. The volumes are getting mounted by the cluster 
> whenever a node takes
> control, but not getting unmounted when the node leaves the cluster. 
> Especially when primary
> node takes control, the volumes mounted on the secondary node (Node 
> which gives control to the
> primary node) is not getting unmounted. Anybody plz help me to solve 
> this issue. Thanks for the help in advance
> 
> 
> Regards
> J. DANIEL
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From wcheng at redhat.com  Thu Jul 12 03:45:23 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 11 Jul 2007 23:45:23 -0400
Subject: [Linux-cluster] strange slowness of ls with 1	newly	created	file
	on	gfs 1 or 2
In-Reply-To: <1184197488.12134.61.camel@localhost>
References: <46941A71.7010602@blackhole.sk>
	<46943F2A.3080707@redhat.com>	<1184171343.5216.7.camel@localhost>
	<46950D03.4000302@redhat.com>	<1184187549.12134.38.camel@localhost>
	<469553B9.5030004@redhat.com> <1184197488.12134.61.camel@localhost>
Message-ID: <4695A3D3.4060904@redhat.com>

Christopher Barry wrote:

>
>Another question you'll likely know the answer to. Is there a preferred
>IO Scheduler to use with GFS? 
>
>  
>
Not I know of  - it is very much application and/or benchmark dependent.

-- Wendy



From fait at anl.gov  Thu Jul 12 08:03:46 2007
From: fait at anl.gov (James Fait)
Date: Thu, 12 Jul 2007 03:03:46 -0500
Subject: [Linux-cluster] details of fencing
Message-ID: <1184227426.24028.49.camel@com30.ser.aps.anl.gov>

I am in the process of implementing clustering for shared data storage
across a number of nodes, with several nodes exporting large GNBD
volumes, and also new storage from an iSCSI raid chassis with 6TB of
storage.  The nature of the application requires that the nodes that
access the data store are pretty much independent of each other, just
providing CPU and graphics support while reading several hundred
megabytes of image data in 32mb chunks, and writing numerous small
summary files of this data.  Our current methodology, which works but is
slow, is to server the data by NFS over gigabit ethernet. A similar
facility nearby, with the same application, has implemented GFS on FC
equipment, and are using the FC switch for fencing.  As I have somewhat
different storage hardware and data retention requirements, I need to
implement different fencing methods.

The storage network is on a 3com switch, which is able to take down a
given link via a telnet command, and later restore it.  Also, each of
the storage nodes has a Smart UPS with control over the individual
outlets on the UPS, which could be used for power fencing of the GNBD
server nodes.  The only issue there is that these are not networked UPS
systems, but are connected via serial ports to some of the nodes.  On
the network switch fencing, I am currently using the storage net for
cluster communications, so bringing down a port also stops cluster
communications.

Each of the Storage systems has at least two network interfaces (most
have 6 or more), one (or more) on the storage net, and one on our
intranet. The data processing units have two net interfaces, one on each
network.

I know I will probably have to write a fence agent for at least some of
the parts of this.  The questions that I have are the exact sequence of
events for fencing a node, as in who initiates the fencing operation,
and what is the sequence of events for recovery and rejoining the
cluster after a reboot.  I currently have a test setup of four nodes
with a 4TB GNBD export from one of the nodes to the other three, using
fence_gnbd on those nodes, and fence_gnbd_notify with fence_manual on
the server, at least until I can get the UPS fence agent working.  If I
need to, I can put the UPS systems on a network terminal server to allow
any node to connect to the UPS for commands, but would prefer that it
connect to one of the cluster nodes directly using the serial port.  For
the iSCSI chassis, from the manual it appears that I can force a iSCSI
disconnect via snmp or telnet using the management interface for the
chassis, which from my reading of the RFQ, should be an effective fence
for iSCSI, as it will invalidate the current connection from the
initiator, and requires a re-authentication and negotiation of the link
before allowing more communications with that node.

Hopefully, this gives enough information to a least get a start on this,
as it is several issues, each which may need separate followup.

Sincerely

James Fait
-- 
James Fait, Ph.D.
Beamline Scientist, SER-CAT
APS, building 436B-008
Argonne National Laboratory
9700 S Cass Ave
Argonne, IL 60439
phone 630-252-0644
fax   630-252-0652
email fait at anl.gov



From kristoffer.lippert at jppol.dk  Thu Jul 12 08:31:18 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Thu, 12 Jul 2007 10:31:18 +0200
Subject: SV: [Linux-cluster] LVS cluster
In-Reply-To: <200707111840.03560.grimme@atix.de>
References: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A9D@exchsrv07.rootdom.dk>
	<200707111840.03560.grimme@atix.de>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A7AAB@exchsrv07.rootdom.dk>

That sounds interesting.

So you have a std. ha cluster running the keepalived as a failover cluster service, and the webservers running outside the cluster (on the same machines). Then the keepalived does the loadbalancing. Is that correctly understood?

Will this take care of session persistence?


Kind regards
Kristoffer

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Marc Grimme
Sendt: 11. juli 2007 18:40
Til: linux clustering
Emne: Re: [Linux-cluster] LVS cluster

On Wednesday 11 July 2007 16:41:32 Kristoffer Lippert wrote:
> One fairly simple question for a change. :-)
>
> We have two frontend webservers running a cluster to handle a SAN (gfs).
> Would it be possible to run the LVS router on the same physical 
> machines as the cluster? Thus creating a fully redundant loadbalanced 
> cluster of our two frontend webservers.
>
> Currently we run with dns RoundRobin "load balancing". (Wich is not so 
> good in failover situations).
>
> Thanks in advance :-)
> Mvh / Kind regards
>
> Kristoffer Lippert
> Systemansvarlig
> JP/Politiken A/S
> Online Magasiner
>
> Tlf. +45 8738 3032
> Cell. +45 6062 8703

we've done this with keepalive (also) and some rgmanager skripts. It runs on the cluster as ha-service and makes loadbalancing.

Regards Marc.

--
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From bsd_daemon at msn.com  Thu Jul 12 08:44:27 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Thu, 12 Jul 2007 08:44:27 +0000
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <20070711203622.GW18076@redhat.com>
Message-ID: <BAY105-F21B705289E8E195B1ED8C7E3FC0@phx.gbl>


hi Lon, l have LVS/DR on my network. I don't use the piranha for LVS/DR. But 
I propose piranha for using LVS/NAT. Additionally, if I want to  do network 
tunning, why  do I use  the piranha ???
for this, I use my knowledge on linux..

have a nice day..

Mehmet CELIK
Istanbul/TURKEY

>From: Lon Hohberger <lhh at redhat.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] doubts in Piranha
>Date: Wed, 11 Jul 2007 16:36:22 -0400
>
>On Wed, Jul 11, 2007 at 05:02:43PM +0200, Tomas Hoger wrote:
> > On 7/11/07, Lon Hohberger <lhh at redhat.com> wrote:
> > >(You can do it in RHCS4 too; I don't know about RHCS3)
> >
> > It's also possible to use LVS-DR on RHEL/RHCS3.
> >
> > Regarding documentation: It only mentions arptables as ARP problem
> > fix.  For kernel 2.6+, there is also another way using sysctl, which
> > does not require additional kernel patches / modules.
>
>There is an IPtables one in the docs too, which works on both 2.4.x and
>2.6.x IIRC.
>
>--
>Lon Hohberger - Software Engineer - Red Hat, Inc.
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://newlivehotmail.com



From grimme at atix.de  Thu Jul 12 09:14:51 2007
From: grimme at atix.de (Marc Grimme)
Date: Thu, 12 Jul 2007 11:14:51 +0200
Subject: SV: [Linux-cluster] LVS cluster
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7AAB@exchsrv07.rootdom.dk>
References: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A9D@exchsrv07.rootdom.dk>
	<200707111840.03560.grimme@atix.de>
	<00B9BFA1C44A674794C9A1A4F5A22CA51A7AAB@exchsrv07.rootdom.dk>
Message-ID: <200707121114.52266.grimme@atix.de>

On Thursday 12 July 2007 10:31:18 Kristoffer Lippert wrote:
> That sounds interesting.
>
> So you have a std. ha cluster running the keepalived as a failover cluster
> service, and the webservers running outside the cluster (on the same
> machines). Then the keepalived does the loadbalancing. Is that correctly
> understood?
I suppose so. Keepalived runs as part of the rgmanager and your applications 
can ran on any node. The vip is a clusterip of rgmanager, going to be failed 
over with keepalived. (There also need to be some other tweeks within the 
kernel - iptables/.. keepalived without vrrp). And the application to be 
loadbalanced runs on all the nodes.
>
> Will this take care of session persistence?
Yes I think that is done via ipvs.
>
>
> Kind regards
> Kristoffer
Regards Marc.
>
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] P? vegne af Marc Grimme Sendt:
> 11. juli 2007 18:40
> Til: linux clustering
> Emne: Re: [Linux-cluster] LVS cluster
>
> On Wednesday 11 July 2007 16:41:32 Kristoffer Lippert wrote:
> > One fairly simple question for a change. :-)
> >
> > We have two frontend webservers running a cluster to handle a SAN (gfs).
> > Would it be possible to run the LVS router on the same physical
> > machines as the cluster? Thus creating a fully redundant loadbalanced
> > cluster of our two frontend webservers.
> >
> > Currently we run with dns RoundRobin "load balancing". (Wich is not so
> > good in failover situations).
> >
> > Thanks in advance :-)
> > Mvh / Kind regards
> >
> > Kristoffer Lippert
> > Systemansvarlig
> > JP/Politiken A/S
> > Online Magasiner
> >
> > Tlf. +45 8738 3032
> > Cell. +45 6062 8703
>
> we've done this with keepalive (also) and some rgmanager skripts. It runs
> on the cluster as ha-service and makes loadbalancing.
>
> Regards Marc.
>
> --
> Gruss / Regards,
>
> Marc Grimme
> Phone: +49-89 452 3538-14
> http://www.atix.de/               http://www.open-sharedroot.org/
>
> **
> ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10
> - 85716 Unterschleissheim - Germany
>
> Registergericht: Amtsgericht M?nchen
> Registernummer: HRB 131682
> USt.-Id.: DE209485962
>
> Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From m.catanese at kinetikon.com  Thu Jul 12 11:50:34 2007
From: m.catanese at kinetikon.com (Matteo Catanese)
Date: Thu, 12 Jul 2007 13:50:34 +0200
Subject: [Linux-cluster] fence problem on rhel5
Message-ID: <12DB0E3B-296B-4773-AAF7-FCC40235534F@kinetikon.com>

Hi, all
im trying to do a HA firewall.

i have 2 machines with RHEL5 + CS

I've configured the cluster with Conga.

I dont have any shared storage, I just rsync by hand any change from  
master machine to slave machine.

Not having shared storage implies no need to use and configure fence  
agent.

But if i cut off power to the master server, then the slave starts  
trying to fence the master and since no fence is configured, fence  
fails forever

Jul 12 12:53:12 fireone fenced[2794]: fencing node "firetwo.i3p.it"
Jul 12 12:53:12 fireone fenced[2794]: fence "firetwo.i3p.it" failed
Jul 12 12:53:17 fireone fenced[2794]: fencing node "firetwo.i3p.it"
Jul 12 12:53:17 fireone fenced[2794]: fence "firetwo.i3p.it" failed
Jul 12 12:53:22 fireone fenced[2794]: fencing node "firetwo.i3p.it"
Jul 12 12:53:22 fireone fenced[2794]: fence "firetwo.i3p.it" failed



I can use fence_manual with a script that always returns succees but  
i dont think this is a clean solution


Regards


Matteo Catanese


  



From lpleiman at redhat.com  Thu Jul 12 12:05:57 2007
From: lpleiman at redhat.com (Leo Pleiman)
Date: Thu, 12 Jul 2007 08:05:57 -0400
Subject: [Linux-cluster] Re: NFSCookbook Redhat 5 Cluster
In-Reply-To: <Pine.LNX.4.64.0707111756490.20938@lycaeum.internal.lan>
References: <Pine.LNX.4.64.0707111756490.20938@lycaeum.internal.lan>
Message-ID: <46961925.1010201@redhat.com>

Dennis,

Sorry I missed your original post. Can you send/resend the location of 
the nfs cookbook white paper?

Thanks!

dennis at demarco.com wrote:
>
> It's most likely bad ettique to reply to your own post, but I have 
> figured out the issue.
>
> If anyone is interested, here is a cluster.conf that seems to work w/ 
> the nfscookbook whitepaper method.
>
> It's a three node xen, running nfs services w/ active active 
> configuration as a test.
>
> - Dennis
>
>
>
>
>
>
>
>
>
>
> <?xml version="1.0"?>
> <cluster alias="cluster1" config_version="122" name="cluster1">
>         <fence_daemon clean_start="0" post_fail_delay="0" 
> post_join_delay="12"/>
>         <clusternodes>
>                 <clusternode name="node03.internal.lan" nodeid="1" 
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device domain="node03" 
> name="xen-fence"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node01.internal.lan" nodeid="2" 
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device domain="node01" 
> name="xen-fence"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node02.internal.lan" nodeid="3" 
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device domain="node2" 
> name="xen-fence"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman/>
>         <fencedevices>
>                 <fencedevice agent="fence_xvm" name="xen-fence"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="perfer_1" ordered="0" 
> restricted="0">
>                                 <failoverdomainnode 
> name="node01.internal.lan" priority="1"/>
>                         </failoverdomain>
>                         <failoverdomain name="perfer_2" ordered="0" 
> restricted="0">
>                                 <failoverdomainnode 
> name="node02.internal.lan" priority="1"/>
>                         </failoverdomain>
>                         <failoverdomain name="perfer_3" ordered="0" 
> restricted="0">
>                                 <failoverdomainnode 
> name="node03.internal.lan" priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <ip address="192.168.1.23" monitor_link="1"/>
>                         <ip address="192.168.1.24" monitor_link="1"/>
>                         <ip address="192.168.1.25" monitor_link="1"/>
>                         <nfsexport name="nfsexport1"/>
>                         <nfsexport name="nfsexport2"/>
>                         <nfsexport name="nfsexport3"/>
>                         <nfsclient options="rw" name="nfsclient1" 
> target="*"/>
>                         <nfsclient options="rw" name="nfsclient2" 
> target="*"/>
>                         <nfsclient options="rw" name="nfsclient3" 
> target="*"/>
>
>
>                         <clusterfs device="/dev/vg0/gfslv2" 
> force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfsdata" name="
> gfs" options="acl"/>
>                 </resources>
>                 <service autostart="1" domain="perfer_1" exclusive="0" 
> name="nfs1" recovery="relocate">
>                         <clusterfs ref="gfs">
>                                 <nfsexport ref="nfsexport1">
>                                         <nfsclient ref="nfsclient1"/>
>                                 </nfsexport>
>                         </clusterfs>
>                         <ip ref="192.168.1.23"/>
>                 </service>
>                 <service autostart="1" domain="perfer_2" exclusive="0" 
> name="nfs2" recovery="relocate">
>                         <clusterfs ref="gfs">
>                                 <nfsexport ref="nfsexport2">
>                                         <nfsclient ref="nfsclient2"/>
>                                 </nfsexport>
>                         </clusterfs>
>                         <ip ref="192.168.1.24"/>
>                 </service>
>                 <service autostart="1" domain="perfer_3" exclusive="0" 
> name="nfs3" recovery="relocate">
>                         <clusterfs ref="gfs">
>                                 <nfsexport ref="nfsexport3">
>                                         <nfsclient ref="nfsclient3"/>
>                                 </nfsexport>
>                         </clusterfs>
>                         <ip ref="192.168.1.25"/>
>                 </service>
>         </rm>
> </cluster>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lpleiman.vcf
Type: text/x-vcard
Size: 194 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070712/2a06ccc8/attachment.vcf>

From dennis at demarco.com  Thu Jul 12 12:50:12 2007
From: dennis at demarco.com (dennis at demarco.com)
Date: Thu, 12 Jul 2007 08:50:12 -0400 (EDT)
Subject: [Linux-cluster] Re: NFSCookbook Redhat 5 Cluster
In-Reply-To: <46961925.1010201@redhat.com>
References: <Pine.LNX.4.64.0707111756490.20938@lycaeum.internal.lan>
	<46961925.1010201@redhat.com>
Message-ID: <Pine.LNX.4.64.0707120849020.14863@lycaeum.internal.lan>


It's listed on the cluster FAQ. It's last updated in March. It's a bit 
hard to follow and not correct in some spots, but good starting point.

- Dennis



http://sources.redhat.com/cluster/doc/nfscookbook.pdf - The Unofficial 
NFS/GFS Cookbook.


On Thu, 12 Jul 2007, Leo Pleiman wrote:

> Dennis,
>
> Sorry I missed your original post. Can you send/resend the location of the 
> nfs cookbook white paper?
>
> Thanks!
>
> dennis at demarco.com wrote:
>> 
>> It's most likely bad ettique to reply to your own post, but I have figured 
>> out the issue.
>> 
>> If anyone is interested, here is a cluster.conf that seems to work w/ the 
>> nfscookbook whitepaper method.
>> 
>> It's a three node xen, running nfs services w/ active active configuration 
>> as a test.
>> 
>> - Dennis
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> <?xml version="1.0"?>
>> <cluster alias="cluster1" config_version="122" name="cluster1">
>>         <fence_daemon clean_start="0" post_fail_delay="0" 
>> post_join_delay="12"/>
>>         <clusternodes>
>>                 <clusternode name="node03.internal.lan" nodeid="1" 
>> votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device domain="node03" 
>> name="xen-fence"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node01.internal.lan" nodeid="2" 
>> votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device domain="node01" 
>> name="xen-fence"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node02.internal.lan" nodeid="3" 
>> votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device domain="node2" 
>> name="xen-fence"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>         </clusternodes>
>>         <cman/>
>>         <fencedevices>
>>                 <fencedevice agent="fence_xvm" name="xen-fence"/>
>>         </fencedevices>
>>         <rm>
>>                 <failoverdomains>
>>                         <failoverdomain name="perfer_1" ordered="0" 
>> restricted="0">
>>                                 <failoverdomainnode 
>> name="node01.internal.lan" priority="1"/>
>>                         </failoverdomain>
>>                         <failoverdomain name="perfer_2" ordered="0" 
>> restricted="0">
>>                                 <failoverdomainnode 
>> name="node02.internal.lan" priority="1"/>
>>                         </failoverdomain>
>>                         <failoverdomain name="perfer_3" ordered="0" 
>> restricted="0">
>>                                 <failoverdomainnode 
>> name="node03.internal.lan" priority="1"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <resources>
>>                         <ip address="192.168.1.23" monitor_link="1"/>
>>                         <ip address="192.168.1.24" monitor_link="1"/>
>>                         <ip address="192.168.1.25" monitor_link="1"/>
>>                         <nfsexport name="nfsexport1"/>
>>                         <nfsexport name="nfsexport2"/>
>>                         <nfsexport name="nfsexport3"/>
>>                         <nfsclient options="rw" name="nfsclient1" 
>> target="*"/>
>>                         <nfsclient options="rw" name="nfsclient2" 
>> target="*"/>
>>                         <nfsclient options="rw" name="nfsclient3" 
>> target="*"/>
>> 
>>
>>                         <clusterfs device="/dev/vg0/gfslv2" 
>> force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfsdata" name="
>> gfs" options="acl"/>
>>                 </resources>
>>                 <service autostart="1" domain="perfer_1" exclusive="0" 
>> name="nfs1" recovery="relocate">
>>                         <clusterfs ref="gfs">
>>                                 <nfsexport ref="nfsexport1">
>>                                         <nfsclient ref="nfsclient1"/>
>>                                 </nfsexport>
>>                         </clusterfs>
>>                         <ip ref="192.168.1.23"/>
>>                 </service>
>>                 <service autostart="1" domain="perfer_2" exclusive="0" 
>> name="nfs2" recovery="relocate">
>>                         <clusterfs ref="gfs">
>>                                 <nfsexport ref="nfsexport2">
>>                                         <nfsclient ref="nfsclient2"/>
>>                                 </nfsexport>
>>                         </clusterfs>
>>                         <ip ref="192.168.1.24"/>
>>                 </service>
>>                 <service autostart="1" domain="perfer_3" exclusive="0" 
>> name="nfs3" recovery="relocate">
>>                         <clusterfs ref="gfs">
>>                                 <nfsexport ref="nfsexport3">
>>                                         <nfsclient ref="nfsclient3"/>
>>                                 </nfsexport>
>>                         </clusterfs>
>>                         <ip ref="192.168.1.25"/>
>>                 </service>
>>         </rm>
>> </cluster>
>> 
>
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From kristoffer.lippert at jppol.dk  Thu Jul 12 13:11:46 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Thu, 12 Jul 2007 15:11:46 +0200
Subject: SV: [Linux-cluster] Re: NFSCookbook Redhat 5 Cluster
In-Reply-To: <Pine.LNX.4.64.0707120849020.14863@lycaeum.internal.lan>
References: <Pine.LNX.4.64.0707111756490.20938@lycaeum.internal.lan><46961925.1010201@redhat.com>
	<Pine.LNX.4.64.0707120849020.14863@lycaeum.internal.lan>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A7ABA@exchsrv07.rootdom.dk>

Yeps. I've used it too. It's a really nice place to start if you're trying to find your way around cluster setup for the first time. 

Regards
Kristoffer
 

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af dennis at demarco.com
Sendt: 12. juli 2007 14:50
Til: linux clustering
Emne: Re: [Linux-cluster] Re: NFSCookbook Redhat 5 Cluster


It's listed on the cluster FAQ. It's last updated in March. It's a bit hard to follow and not correct in some spots, but good starting point.

- Dennis



http://sources.redhat.com/cluster/doc/nfscookbook.pdf - The Unofficial NFS/GFS Cookbook.


On Thu, 12 Jul 2007, Leo Pleiman wrote:

> Dennis,
>
> Sorry I missed your original post. Can you send/resend the location of the 
> nfs cookbook white paper?
>
> Thanks!
>
> dennis at demarco.com wrote:
>> 
>> It's most likely bad ettique to reply to your own post, but I have figured 
>> out the issue.
>> 
>> If anyone is interested, here is a cluster.conf that seems to work w/ the 
>> nfscookbook whitepaper method.
>> 
>> It's a three node xen, running nfs services w/ active active configuration 
>> as a test.
>> 
>> - Dennis
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> <?xml version="1.0"?>
>> <cluster alias="cluster1" config_version="122" name="cluster1">
>>         <fence_daemon clean_start="0" post_fail_delay="0" 
>> post_join_delay="12"/>
>>         <clusternodes>
>>                 <clusternode name="node03.internal.lan" nodeid="1" 
>> votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device domain="node03" 
>> name="xen-fence"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node01.internal.lan" nodeid="2" 
>> votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device domain="node01" 
>> name="xen-fence"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node02.internal.lan" nodeid="3" 
>> votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device domain="node2" 
>> name="xen-fence"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>         </clusternodes>
>>         <cman/>
>>         <fencedevices>
>>                 <fencedevice agent="fence_xvm" name="xen-fence"/>
>>         </fencedevices>
>>         <rm>
>>                 <failoverdomains>
>>                         <failoverdomain name="perfer_1" ordered="0" 
>> restricted="0">
>>                                 <failoverdomainnode 
>> name="node01.internal.lan" priority="1"/>
>>                         </failoverdomain>
>>                         <failoverdomain name="perfer_2" ordered="0" 
>> restricted="0">
>>                                 <failoverdomainnode 
>> name="node02.internal.lan" priority="1"/>
>>                         </failoverdomain>
>>                         <failoverdomain name="perfer_3" ordered="0" 
>> restricted="0">
>>                                 <failoverdomainnode 
>> name="node03.internal.lan" priority="1"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <resources>
>>                         <ip address="192.168.1.23" monitor_link="1"/>
>>                         <ip address="192.168.1.24" monitor_link="1"/>
>>                         <ip address="192.168.1.25" monitor_link="1"/>
>>                         <nfsexport name="nfsexport1"/>
>>                         <nfsexport name="nfsexport2"/>
>>                         <nfsexport name="nfsexport3"/>
>>                         <nfsclient options="rw" name="nfsclient1" 
>> target="*"/>
>>                         <nfsclient options="rw" name="nfsclient2" 
>> target="*"/>
>>                         <nfsclient options="rw" name="nfsclient3" 
>> target="*"/>
>> 
>>
>>                         <clusterfs device="/dev/vg0/gfslv2" 
>> force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfsdata" name="
>> gfs" options="acl"/>
>>                 </resources>
>>                 <service autostart="1" domain="perfer_1" exclusive="0" 
>> name="nfs1" recovery="relocate">
>>                         <clusterfs ref="gfs">
>>                                 <nfsexport ref="nfsexport1">
>>                                         <nfsclient ref="nfsclient1"/>
>>                                 </nfsexport>
>>                         </clusterfs>
>>                         <ip ref="192.168.1.23"/>
>>                 </service>
>>                 <service autostart="1" domain="perfer_2" exclusive="0" 
>> name="nfs2" recovery="relocate">
>>                         <clusterfs ref="gfs">
>>                                 <nfsexport ref="nfsexport2">
>>                                         <nfsclient ref="nfsclient2"/>
>>                                 </nfsexport>
>>                         </clusterfs>
>>                         <ip ref="192.168.1.24"/>
>>                 </service>
>>                 <service autostart="1" domain="perfer_3" exclusive="0" 
>> name="nfs3" recovery="relocate">
>>                         <clusterfs ref="gfs">
>>                                 <nfsexport ref="nfsexport3">
>>                                         <nfsclient ref="nfsclient3"/>
>>                                 </nfsexport>
>>                         </clusterfs>
>>                         <ip ref="192.168.1.25"/>
>>                 </service>
>>         </rm>
>> </cluster>
>> 
>
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From catalin.lupescu at bull.net  Thu Jul 12 13:13:17 2007
From: catalin.lupescu at bull.net (catalin.lupescu at bull.net)
Date: Thu, 12 Jul 2007 15:13:17 +0200
Subject: [Linux-cluster] Problem with fenced on cluster with 2 BladeCenter
 machines: 1st machine is
 remove physically. The remaining one does not became Active (waiting for
 fenced)
Message-ID: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>

Hello!

I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center 
chassis.
(fenced version 1.32.6)

I have done the following test: 
I have removed physically the node 1 machine (the Active one).
The second one is never became active one. "Clustat" command does not 
printing any information.
In /var/log/messages we can found the following messages (repeated):

Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" reports: 
pattern match timed-out at /sbin/fence_bladecenter line 185
Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed

If the node 1 is plugged, the node 2 became Active (fenced OK) 


Here is the configuration of the fenced (cluster.conf)

        <fence_daemon clean_start="1" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="cdrc1-1" votes="1">
                        <fence>
                                <method name="1">
                                        <device blade="3" name="NBC1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="cdrc1-2" votes="1">
                        <fence>
                                <method name="1">
                                        <device blade="5" name="NBC1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_bladecenter" 
ipaddr="172.18.10.63" login="USERID" name="NBC1" passwd="PASSW0RD"/>
        </fencedevices>*


It is a configuration possible to avoid this? 
If not, it is possible to deactivate the fenced?

Best regards!
Catalin LUPESCU
VoIP & Signalling Support

Bull, Architect of an Open World TM 
T?l : 01.30.80.75.43
http://www.bull.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070712/a62af89a/attachment.htm>

From rhurst at bidmc.harvard.edu  Thu Jul 12 13:50:34 2007
From: rhurst at bidmc.harvard.edu (Robert Hurst)
Date: Thu, 12 Jul 2007 09:50:34 -0400
Subject: [Linux-cluster] Problem with fenced on cluster with 2
	BladeCenter machines: 1st machine is remove physically. The remaining
	one does not became Active (waiting for fenced)
In-Reply-To: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>
Message-ID: <1184248234.4533.6.camel@xw9300.bidmc.harvard.edu>

On a related note, what is the correct value for clean_start in
cluster.conf?

<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>

The man page states it should be set to zero "0"... you have it set to
"1", which intuitively, makes more sense.

       To disable fencing at domain-creation time entirely, the -c
option can
       be  used  to  declare  that  all nodes are in a clean or safe
state to
       start.  The clean_start cluster.conf option can  also  be  set
to  do
       this,  but automatically disabling startup fencing in
cluster.conf can
       risk file system corruption.

       Clean-start  is  used  to prevent any startup fencing the daemon
might
       do.  It indicates that the daemon should assume all  nodes  are
in  a
       clean state to start.

         <fence_daemon clean_start="0">
         </fence_daemon>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070712/7104ccd4/attachment.htm>

From rpeterso at redhat.com  Thu Jul 12 14:26:15 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 12 Jul 2007 09:26:15 -0500
Subject: [Linux-cluster] Re: NFSCookbook Redhat 5 Cluster
In-Reply-To: <Pine.LNX.4.64.0707120849020.14863@lycaeum.internal.lan>
References: <Pine.LNX.4.64.0707111756490.20938@lycaeum.internal.lan>
	<46961925.1010201@redhat.com>
	<Pine.LNX.4.64.0707120849020.14863@lycaeum.internal.lan>
Message-ID: <1184250375.11507.271.camel@technetium.msp.redhat.com>

On Thu, 2007-07-12 at 08:50 -0400, dennis at demarco.com wrote:
> It's listed on the cluster FAQ. It's last updated in March. It's a bit 
> hard to follow and not correct in some spots, but good starting point.
> 
> - Dennis

> http://sources.redhat.com/cluster/doc/nfscookbook.pdf - The Unofficial 
> NFS/GFS Cookbook.

Hi Dennis,

At the very bottom of the FAQ, there's a "Links" section that has does
in fact have a link to the NFS/GFS cookbook.

Are you talking about the FAQ or the Cookbook being hard to follow and
not correct in some spots?  If there are problems, inaccuracies or points
of confusion, I'd love to hear about them and I can change them in a 
heartbeat (time permitting, of course).  That's an advantage to them not
being "Official Red Hat" documents.  I'd gladly spend time on fixing them
rather than perpetuate bad information that frustrates everyone.  :)

Regards,

Bob Peterson
Red Hat Cluster Suite




From catalin.lupescu at bull.net  Thu Jul 12 14:51:17 2007
From: catalin.lupescu at bull.net (catalin.lupescu at bull.net)
Date: Thu, 12 Jul 2007 16:51:17 +0200
Subject: =?iso-8859-1?Q?R=E9f=2E_=3A_Re=3A_[Linux-cluster]_Problem_with_fenced_on?=
	cluster with 2	BladeCenter machines: 1st machine is remove physically.
	The remaining	one does not became Active (waiting for fenced)
Message-ID: <OF2D0024B4.52517A4C-ONC1257316.00518551@frcl.bull.fr>

>From Catalin LUPESCU

Explications for the 1 value of this parameter (clean_start="1")

I have continued the initial test (with the clean_start="0"): the node 2 still not working. I have reboot the node2 (with node 1 still 
unplug). 
After the boot, the node 2 still not became Active. 
(fenced is started in loop after the boot)

With the value "1", the fenced is not started at boot because the node 2 
is the first (and the only) machine in the cluster. The machine became 
Active. The fenced is used only after the complete boot of the node 1.
I have done the test with the value "1" and is OK for this part. But the 
original Pb is still there? 

Cordialement,
Catalin LUPESCU
VoIP & Signalling Support

Bull, Architect of an Open World TM 
T?l : 01.30.80.75.43
http://www.bull.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070712/cdd9979f/attachment.htm>

From jparsons at redhat.com  Thu Jul 12 15:02:33 2007
From: jparsons at redhat.com (James Parsons)
Date: Thu, 12 Jul 2007 11:02:33 -0400
Subject: [Linux-cluster] Problem with fenced on cluster with 2 BladeCenter
	machines: 1st machine is remove physically. The remaining one does
	not became Active (waiting for fenced)
In-Reply-To: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>
Message-ID: <46964289.3040104@redhat.com>

catalin.lupescu at bull.net wrote:

>
> Hello!
>
> I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center 
> chassis.
> (fenced version 1.32.6)
>
> I have done the following test:
> I have removed physically the node 1 machine (the Active one).
> The second one is never became active one. "Clustat" command does not 
> printing any information.
> In /var/log/messages we can found the following messages (repeated):
>
> Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
> Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" 
> reports: pattern match timed-out at /sbin/fence_bladecenter line 185
> Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed
>
> If the node 1 is plugged, the node 2 became Active (fenced OK)
>
bz#240509 changed the sleep timeout in the bladecenter agent from 5 to 
10...this is on or about line 193 in /sbin/fence_bladecenter.  See what 
yours is set at, and try pushing it out a bit. This minor change is 
making its way through the distribution chain now.

-j



From Sthistle at gov.nl.ca  Thu Jul 12 15:13:52 2007
From: Sthistle at gov.nl.ca (Thistle, Scott)
Date: Thu, 12 Jul 2007 12:43:52 -0230
Subject: [Linux-cluster] Problem with fenced on cluster with 2
	BladeCentermachines: 1st machine is remove physically. The
	remaining one doesnot became Active (waiting for fenced)
In-Reply-To: <46964289.3040104@redhat.com>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>
	<46964289.3040104@redhat.com>
Message-ID: <506B469CC6211B49BE28F6AC56BBCA35013BF81F@STJH-P102.PSNL.CA>

I am having the same issue. If a blade is not present (i.e. removed for
maintenance), the fence_bladecenter cannot check the state as it is
reported empty. I think it is something simple to fix for those versed
in perl. Normally the fence only runs against a blade that is present.
If the blade is removed while running, you run into this issue.

My case below. Blade #3 is a good node. Blade #2 was removed. The fence
does not work with the blade removed.

system> env -T system:blade[3]
OK
system:blade[3]> power -state
On
system:blade[3]> env -T system:blade[2]
The target bay is empty. 
system:blade[3]> env -T system:blade[1]
OK
system:blade[1]>

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Parsons
Sent: Thursday, July 12, 2007 12:33 PM
To: linux clustering
Subject: Re: [Linux-cluster] Problem with fenced on cluster with 2
BladeCentermachines: 1st machine is remove physically. The remaining one
doesnot became Active (waiting for fenced)

catalin.lupescu at bull.net wrote:

>
> Hello!
>
> I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center 
> chassis.
> (fenced version 1.32.6)
>
> I have done the following test:
> I have removed physically the node 1 machine (the Active one).
> The second one is never became active one. "Clustat" command does not 
> printing any information.
> In /var/log/messages we can found the following messages (repeated):
>
> Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
> Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" 
> reports: pattern match timed-out at /sbin/fence_bladecenter line 185 
> Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed
>
> If the node 1 is plugged, the node 2 became Active (fenced OK)
>
bz#240509 changed the sleep timeout in the bladecenter agent from 5 to
10...this is on or about line 193 in /sbin/fence_bladecenter.  See what
yours is set at, and try pushing it out a bit. This minor change is
making its way through the distribution chain now.

-j

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jparsons at redhat.com  Thu Jul 12 15:43:31 2007
From: jparsons at redhat.com (James Parsons)
Date: Thu, 12 Jul 2007 11:43:31 -0400
Subject: [Linux-cluster] Problem with fenced on cluster with
	2	BladeCentermachines:
	1st machine is remove physically. The	remaining one doesnot became
	Active (waiting for fenced)
In-Reply-To: <506B469CC6211B49BE28F6AC56BBCA35013BF81F@STJH-P102.PSNL.CA>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>	<46964289.3040104@redhat.com>
	<506B469CC6211B49BE28F6AC56BBCA35013BF81F@STJH-P102.PSNL.CA>
Message-ID: <46964C23.8050809@redhat.com>

Thistle, Scott wrote:

>I am having the same issue. If a blade is not present (i.e. removed for
>maintenance), the fence_bladecenter cannot check the state as it is
>reported empty. I think it is something simple to fix for those versed
>in perl. Normally the fence only runs against a blade that is present.
>If the blade is removed while running, you run into this issue.
>
I believe this is what you want to happen...if state cannot be checked, 
fenced keeps trying. How could you determine it was safe to stop without 
persisting some value like the number of fence tries, and trying to 
reason out whether it was safe to stop? This will not happen if you 
remove the blade from the cluster before physically removing it. It is a 
snap to do this  with one of the UIs, if you are not prejudiced against 
UIs :).

Also, removing the node from cluster membership before jerking it out of 
the rack tells rgmanager to move any services off of it  - rather than 
having to depend on heartbeat failure to make this happen.

That said, if the blade catches fire and a cage IT guy notices and jerks 
it quick, (using his IT Oven Mitt, of course) it is silly for fenced to 
keep incessantly trying when the thing no longer even exists. Perhaps 
the correct solution would be to have the fence_bladecenter report 
success if the bladecenter admin unit reports that 'no status is 
available' for a particular blade - obviously if the thing is not there, 
it should be safe to say it is fenced :)

If this addresses your situation (I think it does), now would be a 
REALLY good time to file a ticket requesting
this behavior - like today! I'll post a fixed version to the ticket when 
it is ready.

Thanks to Lon for discussing this with me...;)

Regards,

-Jim

>
>My case below. Blade #3 is a good node. Blade #2 was removed. The fence
>does not work with the blade removed.
>
>system> env -T system:blade[3]
>OK
>system:blade[3]> power -state
>On
>system:blade[3]> env -T system:blade[2]
>The target bay is empty. 
>system:blade[3]> env -T system:blade[1]
>OK
>system:blade[1]>
>
>-----Original Message-----
>From: linux-cluster-bounces at redhat.com
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Parsons
>Sent: Thursday, July 12, 2007 12:33 PM
>To: linux clustering
>Subject: Re: [Linux-cluster] Problem with fenced on cluster with 2
>BladeCentermachines: 1st machine is remove physically. The remaining one
>doesnot became Active (waiting for fenced)
>
>catalin.lupescu at bull.net wrote:
>
>  
>
>>Hello!
>>
>>I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center 
>>chassis.
>>(fenced version 1.32.6)
>>
>>I have done the following test:
>>I have removed physically the node 1 machine (the Active one).
>>The second one is never became active one. "Clustat" command does not 
>>printing any information.
>>In /var/log/messages we can found the following messages (repeated):
>>
>>Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
>>Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" 
>>reports: pattern match timed-out at /sbin/fence_bladecenter line 185 
>>Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed
>>
>>If the node 1 is plugged, the node 2 became Active (fenced OK)
>>
>>    
>>
>bz#240509 changed the sleep timeout in the bladecenter agent from 5 to
>10...this is on or about line 193 in /sbin/fence_bladecenter.  See what
>yours is set at, and try pushing it out a bit. This minor change is
>making its way through the distribution chain now.
>
>-j
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>




From catalin.lupescu at bull.net  Thu Jul 12 16:08:47 2007
From: catalin.lupescu at bull.net (catalin.lupescu at bull.net)
Date: Thu, 12 Jul 2007 18:08:47 +0200
Subject: =?iso-8859-1?Q?R=E9f=2E_=3A_Re=3A_[Linux-cluster]_Problem_with_fenced_on?=
	cluster with 2 BladeCenter	machines: 1st machine is remove physically.
	The remaining one does	not became Active (waiting for fenced)
Message-ID: <OFACE9D4F0.3A45CD6C-ONC1257316.00589FF7@frcl.bull.fr>

Hello!

I have change to 10 and also to 20. 
Nothing change (still the same pb)

Best regards!

Catalin LUPESCU
VoIP & Signalling Support

Bull, Architect of an Open World TM 
T?l : 01.30.80.75.43
http://www.bull.com 





James Parsons <jparsons at redhat.com>
Envoy? par : linux-cluster-bounces at redhat.com
12/07/2007 17:02
Veuillez r?pondre ? linux clustering

 
        Pour :  linux clustering <linux-cluster at redhat.com>
        cc : 
        Objet : Re: [Linux-cluster] Problem with fenced on cluster with 2 BladeCenter 
machines: 1st machine is remove physically. The remaining one does not 
became Active (waiting for fenced)

catalin.lupescu at bull.net wrote:

>
> Hello!
>
> I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center 
> chassis.
> (fenced version 1.32.6)
>
> I have done the following test:
> I have removed physically the node 1 machine (the Active one).
> The second one is never became active one. "Clustat" command does not 
> printing any information.
> In /var/log/messages we can found the following messages (repeated):
>
> Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
> Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" 
> reports: pattern match timed-out at /sbin/fence_bladecenter line 185
> Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed
>
> If the node 1 is plugged, the node 2 became Active (fenced OK)
>
bz#240509 changed the sleep timeout in the bladecenter agent from 5 to 
10...this is on or about line 193 in /sbin/fence_bladecenter.  See what 
yours is set at, and try pushing it out a bit. This minor change is 
making its way through the distribution chain now.

-j

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070712/9a2a3c58/attachment.htm>

From dennis at demarco.com  Thu Jul 12 16:18:01 2007
From: dennis at demarco.com (dennis at demarco.com)
Date: Thu, 12 Jul 2007 12:18:01 -0400 (EDT)
Subject: [Linux-cluster] Re: NFSCookbook Redhat 5 Cluster
In-Reply-To: <1184250375.11507.271.camel@technetium.msp.redhat.com>
References: <Pine.LNX.4.64.0707111756490.20938@lycaeum.internal.lan>
	<46961925.1010201@redhat.com>
	<Pine.LNX.4.64.0707120849020.14863@lycaeum.internal.lan>
	<1184250375.11507.271.camel@technetium.msp.redhat.com>
Message-ID: <Pine.LNX.4.64.0707121158270.21561@lycaeum.internal.lan>


I was refering to the NFS cookbook being hard to follow in some areas. 
That could just be me being dense though ;)

One issue I had was needing to define a single gfs (clusterfs) as a resource, 
than having multiple ones.

Example: bobfs1, bobfs2, etc, gave me an error in 5.0. as the devices were 
not unique. (If I can remember the error correctly)

The directions show you to create manage resources for GFS (step 7b), 
then latern step 7c, duplicate your work? Step 7c needs to be tightened 
up. No mention of adding the IP resource.

(In the Manual config method, xml example)

In the  services area, you might want to put the IP address after the 
clusterfs reference. So the IP address is added last on service start.

I'm in RH436 class now, but next week I'll send you an e-mail with some 
ideas/changes, etc on my test setup. I'm using Xen on a desktop to study 
for the class and it's a major help.



Thanks,
Dennis

On Thu, 12 Jul 2007, Bob Peterson wrote:

> On Thu, 2007-07-12 at 08:50 -0400, dennis at demarco.com wrote:
>> It's listed on the cluster FAQ. It's last updated in March. It's a bit
>> hard to follow and not correct in some spots, but good starting point.
>>
>> - Dennis
>
>> http://sources.redhat.com/cluster/doc/nfscookbook.pdf - The Unofficial
>> NFS/GFS Cookbook.
>
> Hi Dennis,
>
> At the very bottom of the FAQ, there's a "Links" section that has does
> in fact have a link to the NFS/GFS cookbook.
>
> Are you talking about the FAQ or the Cookbook being hard to follow and
> not correct in some spots?  If there are problems, inaccuracies or points
> of confusion, I'd love to hear about them and I can change them in a
> heartbeat (time permitting, of course).  That's an advantage to them not
> being "Official Red Hat" documents.  I'd gladly spend time on fixing them
> rather than perpetuate bad information that frustrates everyone.  :)
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From Sthistle at gov.nl.ca  Thu Jul 12 16:24:29 2007
From: Sthistle at gov.nl.ca (Thistle, Scott)
Date: Thu, 12 Jul 2007 13:54:29 -0230
Subject: [Linux-cluster] Problem with fenced on cluster
	with2	BladeCentermachines:1st machine is remove physically.
	The	remaining one doesnot becameActive (waiting for fenced)
In-Reply-To: <46964C23.8050809@redhat.com>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>	<46964289.3040104@redhat.com><506B469CC6211B49BE28F6AC56BBCA35013BF81F@STJH-P102.PSNL.CA>
	<46964C23.8050809@redhat.com>
Message-ID: <506B469CC6211B49BE28F6AC56BBCA35013BF822@STJH-P102.PSNL.CA>

How do I submit a ticket? Call RH support? 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Parsons
Sent: Thursday, July 12, 2007 1:14 PM
To: linux clustering
Subject: Re: [Linux-cluster] Problem with fenced on cluster with2
BladeCentermachines:1st machine is remove physically. The remaining one
doesnot becameActive (waiting for fenced)

Thistle, Scott wrote:

>I am having the same issue. If a blade is not present (i.e. removed for

>maintenance), the fence_bladecenter cannot check the state as it is 
>reported empty. I think it is something simple to fix for those versed 
>in perl. Normally the fence only runs against a blade that is present.
>If the blade is removed while running, you run into this issue.
>
I believe this is what you want to happen...if state cannot be checked,
fenced keeps trying. How could you determine it was safe to stop without
persisting some value like the number of fence tries, and trying to
reason out whether it was safe to stop? This will not happen if you
remove the blade from the cluster before physically removing it. It is a
snap to do this  with one of the UIs, if you are not prejudiced against
UIs :).

Also, removing the node from cluster membership before jerking it out of
the rack tells rgmanager to move any services off of it  - rather than
having to depend on heartbeat failure to make this happen.

That said, if the blade catches fire and a cage IT guy notices and jerks
it quick, (using his IT Oven Mitt, of course) it is silly for fenced to
keep incessantly trying when the thing no longer even exists. Perhaps
the correct solution would be to have the fence_bladecenter report
success if the bladecenter admin unit reports that 'no status is
available' for a particular blade - obviously if the thing is not there,
it should be safe to say it is fenced :)

If this addresses your situation (I think it does), now would be a
REALLY good time to file a ticket requesting this behavior - like today!
I'll post a fixed version to the ticket when it is ready.

Thanks to Lon for discussing this with me...;)

Regards,

-Jim

>
>My case below. Blade #3 is a good node. Blade #2 was removed. The fence

>does not work with the blade removed.
>
>system> env -T system:blade[3]
>OK
>system:blade[3]> power -state
>On
>system:blade[3]> env -T system:blade[2] The target bay is empty.
>system:blade[3]> env -T system:blade[1] OK system:blade[1]>
>
>-----Original Message-----
>From: linux-cluster-bounces at redhat.com
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Parsons
>Sent: Thursday, July 12, 2007 12:33 PM
>To: linux clustering
>Subject: Re: [Linux-cluster] Problem with fenced on cluster with 2
>BladeCentermachines: 1st machine is remove physically. The remaining 
>one doesnot became Active (waiting for fenced)
>
>catalin.lupescu at bull.net wrote:
>
>  
>
>>Hello!
>>
>>I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center 
>>chassis.
>>(fenced version 1.32.6)
>>
>>I have done the following test:
>>I have removed physically the node 1 machine (the Active one).
>>The second one is never became active one. "Clustat" command does not 
>>printing any information.
>>In /var/log/messages we can found the following messages (repeated):
>>
>>Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
>>Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" 
>>reports: pattern match timed-out at /sbin/fence_bladecenter line 185 
>>Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed
>>
>>If the node 1 is plugged, the node 2 became Active (fenced OK)
>>
>>    
>>
>bz#240509 changed the sleep timeout in the bladecenter agent from 5 to 
>10...this is on or about line 193 in /sbin/fence_bladecenter.  See what

>yours is set at, and try pushing it out a bit. This minor change is 
>making its way through the distribution chain now.
>
>-j
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jparsons at redhat.com  Thu Jul 12 16:49:01 2007
From: jparsons at redhat.com (James Parsons)
Date: Thu, 12 Jul 2007 12:49:01 -0400
Subject: [Linux-cluster] Problem with fenced on
	cluster	with2	BladeCentermachines:1st
	machine is remove physically.	The	remaining one doesnot becameActive
	(waiting for fenced)
In-Reply-To: <506B469CC6211B49BE28F6AC56BBCA35013BF822@STJH-P102.PSNL.CA>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>	<46964289.3040104@redhat.com><506B469CC6211B49BE28F6AC56BBCA35013BF81F@STJH-P102.PSNL.CA>	<46964C23.8050809@redhat.com>
	<506B469CC6211B49BE28F6AC56BBCA35013BF822@STJH-P102.PSNL.CA>
Message-ID: <46965B7D.3010809@redhat.com>

Thistle, Scott wrote:

>How do I submit a ticket? Call RH support? 
>  
>
bugzilla.redhat.com

file it against the cluster suite product / fence component for rhel4
file it against rh enterprise linux / fence component for rhel5

Thanks and regards,

-J



From chawkins at veracitynetworks.com  Thu Jul 12 17:19:49 2007
From: chawkins at veracitynetworks.com (Christopher Hawkins)
Date: Thu, 12 Jul 2007 13:19:49 -0400
Subject: [Linux-cluster] LVS cluster
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7AAB@exchsrv07.rootdom.dk>
Message-ID: <200707121701.l6CH1SPH011111@mail2.ontariocreditcorp.com>

LVS takes care of persistence and load balancing. Keepalived takes care of
informing LVS as to which servers are available to handle incoming
requests... 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Kristoffer Lippert
> Sent: Thursday, July 12, 2007 4:31 AM
> To: linux clustering
> Subject: SV: [Linux-cluster] LVS cluster
> 
> That sounds interesting.
> 
> So you have a std. ha cluster running the keepalived as a 
> failover cluster service, and the webservers running outside 
> the cluster (on the same machines). Then the keepalived does 
> the loadbalancing. Is that correctly understood?
> 
> Will this take care of session persistence?
> 
> 
> Kind regards
> Kristoffer
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] P? vegne af Marc Grimme
> Sendt: 11. juli 2007 18:40
> Til: linux clustering
> Emne: Re: [Linux-cluster] LVS cluster
> 
> On Wednesday 11 July 2007 16:41:32 Kristoffer Lippert wrote:
> > One fairly simple question for a change. :-)
> >
> > We have two frontend webservers running a cluster to handle 
> a SAN (gfs).
> > Would it be possible to run the LVS router on the same physical 
> > machines as the cluster? Thus creating a fully redundant 
> loadbalanced 
> > cluster of our two frontend webservers.
> >
> > Currently we run with dns RoundRobin "load balancing". 
> (Wich is not so 
> > good in failover situations).
> >
> > Thanks in advance :-)
> > Mvh / Kind regards
> >
> > Kristoffer Lippert
> > Systemansvarlig
> > JP/Politiken A/S
> > Online Magasiner
> >
> > Tlf. +45 8738 3032
> > Cell. +45 6062 8703
> 
> we've done this with keepalive (also) and some rgmanager 
> skripts. It runs on the cluster as ha-service and makes loadbalancing.
> 
> Regards Marc.
> 
> --
> Gruss / Regards,
> 
> Marc Grimme
> Phone: +49-89 452 3538-14
> http://www.atix.de/               http://www.open-sharedroot.org/
> 
> **
> ATIX - Ges. fuer Informationstechnologie und Consulting mbH 
> Einsteinstr. 10 - 85716 Unterschleissheim - Germany
> 
> Registergericht: Amtsgericht M?nchen
> Registernummer: HRB 131682
> USt.-Id.: DE209485962
> 
> Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 




From Sthistle at gov.nl.ca  Thu Jul 12 17:37:58 2007
From: Sthistle at gov.nl.ca (Thistle, Scott)
Date: Thu, 12 Jul 2007 15:07:58 -0230
Subject: [Linux-cluster] Problem with fenced
	oncluster	with2	BladeCentermachines:1stmachine is remove
	physically.	The	remaining one doesnot becameActive(waiting for fenced)
In-Reply-To: <46965B7D.3010809@redhat.com>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>	<46964289.3040104@redhat.com><506B469CC6211B49BE28F6AC56BBCA35013BF81F@STJH-P102.PSNL.CA>	<46964C23.8050809@redhat.com><506B469CC6211B49BE28F6AC56BBCA35013BF822@STJH-P102.PSNL.CA>
	<46965B7D.3010809@redhat.com>
Message-ID: <506B469CC6211B49BE28F6AC56BBCA35013BF824@STJH-P102.PSNL.CA>

Submitted..

Bugzilla Bug 248006: fence_bladecenter fails when blade is not
physically present
 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Parsons
Sent: Thursday, July 12, 2007 2:19 PM
To: linux clustering
Subject: Re: [Linux-cluster] Problem with fenced oncluster with2
BladeCentermachines:1stmachine is remove physically. The remaining one
doesnot becameActive(waiting for fenced)

Thistle, Scott wrote:

>How do I submit a ticket? Call RH support? 
>  
>
bugzilla.redhat.com

file it against the cluster suite product / fence component for rhel4
file it against rh enterprise linux / fence component for rhel5

Thanks and regards,

-J

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Thu Jul 12 18:34:58 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 12 Jul 2007 14:34:58 -0400
Subject: AW: [Linux-cluster] Redhat cluster and GFS node limit
In-Reply-To: <994144.14043.qm@web32201.mail.mud.yahoo.com>
References: <56825E6A1E167E41A80CE0F6F6916CAB0266A7A3@MNTSVCL10E.mmgmuc.de>
	<994144.14043.qm@web32201.mail.mud.yahoo.com>
Message-ID: <20070712183458.GY18076@redhat.com>

On Wed, Jul 11, 2007 at 02:36:45PM -0700, Hal wrote:
> Thanks for the quick answers! But the answers raised few more questions.
> Yes I need LVS to balance ~100 nodes which access the same data on GFS.

> In this case what modules do I need from the RedHat Cluster? 
> Do I need locks (DLM)? I think I do, don't I?

On RHEL4 and the like:

dlm, ccs, gfs, cman, cman-kernel, gfs-kernel, fence... I think that's it

On RHEL5 and the like:

openais, cman (includes ccs, cman, fence), dlm, kmod-gfs...

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jul 12 18:48:10 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 12 Jul 2007 14:48:10 -0400
Subject: [Linux-cluster] details of fencing
In-Reply-To: <1184227426.24028.49.camel@com30.ser.aps.anl.gov>
References: <1184227426.24028.49.camel@com30.ser.aps.anl.gov>
Message-ID: <20070712184809.GZ18076@redhat.com>

On Thu, Jul 12, 2007 at 03:03:46AM -0500, James Fait wrote:
> I am in the process of implementing clustering for shared data storage
> across a number of nodes, with several nodes exporting large GNBD
> volumes, and also new storage from an iSCSI raid chassis with 6TB of
> storage.  The nature of the application requires that the nodes that
> access the data store are pretty much independent of each other, just
> providing CPU and graphics support while reading several hundred
> megabytes of image data in 32mb chunks, and writing numerous small
> summary files of this data.  Our current methodology, which works but is
> slow, is to server the data by NFS over gigabit ethernet. A similar
> facility nearby, with the same application, has implemented GFS on FC
> equipment, and are using the FC switch for fencing.  As I have somewhat
> different storage hardware and data retention requirements, I need to
> implement different fencing methods.
> 
> The storage network is on a 3com switch, which is able to take down a
> given link via a telnet command, and later restore it.

We don't have an agent for this; you'll have to assemble one.

> Also, each of
> the storage nodes has a Smart UPS with control over the individual
> outlets on the UPS, which could be used for power fencing of the GNBD
> server nodes.  The only issue there is that these are not networked UPS
> systems, but are connected via serial ports to some of the nodes.  On
> the network switch fencing, I am currently using the storage net for
> cluster communications, so bringing down a port also stops cluster
> communications.

Loss of cluster comms is generally what will require I/O fencing in the
first place.  That is, a node being fenced and losing cluster comms is
quite fine.

Note, however, that without power fencing, you can't just turn ports
back on - you need to go in and reboot the nodes that are fenced and
manually re-enable them.

> I know I will probably have to write a fence agent for at least some of
> the parts of this.  The questions that I have are the exact sequence of
> events for fencing a node, as in who initiates the fencing operation,

One node handles fencing if I recall correctly.  I don't remember how
the node is chosen; someone else will have to answer that.


> and what is the sequence of events for recovery and rejoining the
> cluster after a reboot. 

(a) node is fenced
(b) administrator turns node off
(c) administrator unfences node
(d) administrator turns node on

... should work :)

> I currently have a test setup of four nodes
> with a 4TB GNBD export from one of the nodes to the other three, using
> fence_gnbd on those nodes, and fence_gnbd_notify with fence_manual on
> the server, at least until I can get the UPS fence agent working. 

> If I
> need to, I can put the UPS systems on a network terminal server to allow
> any node to connect to the UPS for commands, but would prefer that it
> connect to one of the cluster nodes directly using the serial port. 

The terminal server idea will work better - and be less complex in the
end.  It won't matter which node fences which other node - and in the
same realm of logic, it won't matter which nodes are online at the time
fencing must be done for a particular node.

> For
> the iSCSI chassis, from the manual it appears that I can force a iSCSI
> disconnect via snmp or telnet using the management interface for the
> chassis, which from my reading of the RFQ, should be an effective fence
> for iSCSI, as it will invalidate the current connection from the
> initiator, and requires a re-authentication and negotiation of the link
> before allowing more communications with that node.

Yes, and the node will need to be rebooted.


> Hopefully, this gives enough information to a least get a start on this,
> as it is several issues, each which may need separate followup.

If all of the nodes have power fencing, you can avoid a lot of hassle.
Try to take the simplest path.  Complexity is inversely proportional to
reliability.

How many nodes have UPS access?


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jul 12 18:51:12 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 12 Jul 2007 14:51:12 -0400
Subject: [Linux-cluster] fence problem on rhel5
In-Reply-To: <12DB0E3B-296B-4773-AAF7-FCC40235534F@kinetikon.com>
References: <12DB0E3B-296B-4773-AAF7-FCC40235534F@kinetikon.com>
Message-ID: <20070712185112.GA18076@redhat.com>

On Thu, Jul 12, 2007 at 01:50:34PM +0200, Matteo Catanese wrote:
> I dont have any shared storage, I just rsync by hand any change from  
> master machine to slave machine.

> I can use fence_manual with a script that always returns succees but  
> i dont think this is a clean solution

Make the cman init script not start fenced if you have no shared storage
(not even NFS)?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From chris at cmiware.com  Thu Jul 12 18:57:57 2007
From: chris at cmiware.com (Chris Harms)
Date: Thu, 12 Jul 2007 13:57:57 -0500
Subject: [Linux-cluster] fence problem on rhel5
In-Reply-To: <20070712185112.GA18076@redhat.com>
References: <12DB0E3B-296B-4773-AAF7-FCC40235534F@kinetikon.com>
	<20070712185112.GA18076@redhat.com>
Message-ID: <469679B5.7010406@cmiware.com>

What are the implications of this?  i.e. , if there is node failure and 
fenced is not running, does the cluster skip fencing and assume services?

Lon Hohberger wrote:
> On Thu, Jul 12, 2007 at 01:50:34PM +0200, Matteo Catanese wrote:
>   
>> I dont have any shared storage, I just rsync by hand any change from  
>> master machine to slave machine.
>>     
>
>   
>> I can use fence_manual with a script that always returns succees but  
>> i dont think this is a clean solution
>>     
>
> Make the cman init script not start fenced if you have no shared storage
> (not even NFS)?
>
>   



From lhh at redhat.com  Thu Jul 12 21:13:41 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 12 Jul 2007 17:13:41 -0400
Subject: [Linux-cluster] fence problem on rhel5
In-Reply-To: <469679B5.7010406@cmiware.com>
References: <12DB0E3B-296B-4773-AAF7-FCC40235534F@kinetikon.com>
	<20070712185112.GA18076@redhat.com> <469679B5.7010406@cmiware.com>
Message-ID: <20070712211341.GD18076@redhat.com>

On Thu, Jul 12, 2007 at 01:57:57PM -0500, Chris Harms wrote:
> What are the implications of this?  i.e. , if there is node failure and 
> fenced is not running, does the cluster skip fencing and assume services?

That's what's supposed to happen.

Obviously if you have shared storage, you can expect data corruption too
^^;;

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From chris at cmiware.com  Thu Jul 12 21:42:45 2007
From: chris at cmiware.com (Chris Harms)
Date: Thu, 12 Jul 2007 16:42:45 -0500
Subject: [Linux-cluster] fence problem on rhel5
In-Reply-To: <20070712211341.GD18076@redhat.com>
References: <12DB0E3B-296B-4773-AAF7-FCC40235534F@kinetikon.com>	<20070712185112.GA18076@redhat.com>
	<469679B5.7010406@cmiware.com> <20070712211341.GD18076@redhat.com>
Message-ID: <4696A055.2040201@cmiware.com>

I do not have cluster storage, e.g. SAN, GFS, NFS.  Using 2 DRBD 
patitions for postgres so I will evaluate this option carefully.  If it 
can take care of its own split brain recovery and avoid FS corruption, I 
may chance the lack of fencing as a nonfunctional cluster is pretty bad.


Lon Hohberger wrote:
> On Thu, Jul 12, 2007 at 01:57:57PM -0500, Chris Harms wrote:
>   
>> What are the implications of this?  i.e. , if there is node failure and 
>> fenced is not running, does the cluster skip fencing and assume services?
>>     
>
> That's what's supposed to happen.
>
> Obviously if you have shared storage, you can expect data corruption too
> ^^;;
>
>   



From m.catanese at kinetikon.com  Fri Jul 13 10:19:04 2007
From: m.catanese at kinetikon.com (Matteo Catanese)
Date: Fri, 13 Jul 2007 12:19:04 +0200
Subject: [Linux-cluster] fence problem on rhel5 and another question
Message-ID: <C537B996-78A6-443E-A7A8-10FE68F6687D@kinetikon.com>

 >Make the cman init script not start fenced if you have no shared  
storage

I don't think modifying a red hat vanilla startup script it's a good  
idea.
I will loose all changes in case of a cman package update
Anyway i will try the attached  little patch.

While making this patch i was looking at /etc/init.d/cman script and  
i found this:

if fence_xvmd_enabled() then start fence_xvmd

Can you modify cman script to use something similar for normal fence ?

if fence_enable() then start fence; else don't stress :-)

 >(not even NFS)?

No nfs, nothing, i just need a virtual ip addres.


Now the other question:

Granted that  I dont want to install  non-certified, non-rhn- 
updateable software, the question is: is there something similar to  
DRBD in rhel5 ?

This can be very useful when  you need a little shared partition  
without struggling with an expensive shared storage

Regards

Matteo Catanese

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nofence.diff
Type: application/octet-stream
Size: 2156 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070713/95d10681/attachment.obj>
-------------- next part --------------


From tomas.hoger at gmail.com  Fri Jul 13 11:46:06 2007
From: tomas.hoger at gmail.com (Tomas Hoger)
Date: Fri, 13 Jul 2007 13:46:06 +0200
Subject: [Linux-cluster] LVS cluster
In-Reply-To: <200707111543.l6BFhBPH029186@mail2.ontariocreditcorp.com>
References: <6cfbd1b40707110812y1c3465d8m2472dc5c062c8dba@mail.gmail.com>
	<200707111543.l6BFhBPH029186@mail2.ontariocreditcorp.com>
Message-ID: <6cfbd1b40707130446q75c873a4kea6e5f96c855404d@mail.gmail.com>

On 7/11/07, Christopher Hawkins <chawkins at veracitynetworks.com> wrote:
> I'd suggest looking into keepalived as well. It (or something similar) will
> perform service checks and will add / remove nodes from the LVS tables on
> the fly if they go online / offline. That should fix your failover issue.

Piranha/pulse, keepalived, ldirectord, ipvsman - all are different
tools with similar purpose - monitor services on real servers and
update LVS configuration as needed.

th.



From rpeterso at redhat.com  Fri Jul 13 13:35:08 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 13 Jul 2007 08:35:08 -0500
Subject: [Linux-cluster] a couple of questions regarding clusters
In-Reply-To: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz>
References: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz>
Message-ID: <1184333708.517.26.camel@technetium.msp.redhat.com>

On Sat, 2007-06-23 at 17:23 -0500, Brent Sachnoff wrote:
> I have a 3 node cluster running redhat 4 with gfs.  What is the proper
> way to have a node leave the cluster for maintenance and then rejoin
> after maintenance is completed?  From the docs, I have read that I
> need to unmount gfs and then stop all the services in the following
> order: rgmanager, gfs, clvmd, fenced.  I can then issue a cman_tool
> leave (remove) request.
> 
>  
> 
> I have also noticed that if I lose ip connectivity to a certain node I
> lose gfs connectivity with the other two nodes.  I thought that I
> would only need 2 votes to continue connectivity. 
> 
>  
> 
> Thanks for the help!

Hi Brent,

I've been up to my eyeballs in alligators recently, and somehow your
email seems to have slipped my attention.  Sorry about that.
Usually I'm better at responding to emails (usually within one day).
I don't know if anyone has answered your question yet (it's always
better to ask on the linux-cluster public mailing list where you
can get input from hundreds of people rather than just me) but I'll
try.

Your summary of how to remove a node for maintenance sounds right.
That should be it.

If you unmount the gfs file system from one of the nodes,
the other nodes should still have normal access to it.
If a node leaves the cluster under abnormal conditions, like 
it dies or has the power switch turned off, the other nodes will
freeze their activity on the gfs file system until they know the
node has been fenced properly.  That usually means an "ACK" from
the fence agent is passed to the other nodes.  If you're using
manual fencing, they'll hang forever, waiting for the ack command.
That's done on purpose to ensure the file system's integrity.

I've not seen a problem with that area.  If you're having a problem
with it, you can open up a bugzilla record against the appropriate
version of software and we'll dig into it, but you'll probably
need to provide instructions on how to recreate it, since we
haven't seen the problem here.

Regards,

Bob Peterson
Red Hat Cluster Suite





From joseparrella at gmail.com  Fri Jul 13 16:18:36 2007
From: joseparrella at gmail.com (=?UTF-8?B?Sm9zw6kgTWlndWVsIFBhcnJlbGxhIFJvbWVybw==?=)
Date: Fri, 13 Jul 2007 12:18:36 -0400
Subject: [Linux-cluster] Two-node clusters using GFS and shared storage
Message-ID: <4697A5DC.5010705@gmail.com>

Greetings,

I've been trying to setup a two-node cluster using a shared SAN (via
Fibre Channel) and GFS. I've previously tried OCFS2, and I don't want to
use NFS yet. The cluster must be an active-active one, and it runs on
Itanium2 machines with Debian 4.0. I'm using cman 1.03.00

I've setup a cluster using Red Hat tools, and my
/etc/cluster/cluster.conf looks like:

-- my cluster.conf --

<?xml version="1.0"?>
<cluster name="correo" config_version="1">

<cman two_node="1" expected_votes="1">
</cman>

<clusternodes>

<clusternode name="node1" votes="1">
</clusternode>

<clusternode name="node2" votes="1">
</clusternode>

</clusternodes>

</cluster>

-- end my cluster.conf --

Note that I've removed entries related to fencing, but I previously had
a 'manual' fencing method. So I've an LVM volume which contains a GFS
filesystem, and I'm able to start ccsd, cman, fenced, clvmd and all the
other related applications.

Syslog reports that the cluster is quorate, and I'm able to mount the
filesystem in both of my nodes. They need to write to the shared storage
in an active-active fashion.

I expect that removing the network cable in node1 would do the following:

a) node1 would be disabled (right, it doesn't have a network cable)
b) node2 would notice node1 is not there and will keep writing to the
shared storage
c) Eventually node1 will come back, and node2 will notice it, so it will
hopefully start writing again

And this it what happens when I unplug the network cable:

a) node1 is disabled (no connectivity)
b) node2 is also disabled! (trying to write to /home and /var/mail
stalls the machine, and then logins and other processes are stalled)
c) Plugging the cable back does nothing (both machines are hanged now,
so I need to reboot them)

I'm probably missing something, since this solution using OCFS2 also has
the same problem! Our last-resort solution is active-active NFS using
Heartbeat, but then we wouldn't be writing to the SAN through FC (2Gbps)
but through Ethernet (1Gbps) since we don't have any other media around ATM.

Is this a configuration related problem? Or is this a design feature in
 both GFS/OCFS2? Or maybe I'm just missing the whole picture...

Thank you very much for any advice,
Jose



From Robert.Gil at americanhm.com  Fri Jul 13 16:28:43 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 13 Jul 2007 12:28:43 -0400
Subject: [Linux-cluster] Two-node clusters using GFS and shared storage
In-Reply-To: <4697A5DC.5010705@gmail.com>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA8AF@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

Redhat recommends that you use hardware based fencing. Manual fencing causes a lot of issues. Its fine in dev but not for production. You manually need to fence the device and remove it from the cluster when you remove the network cable. If you have some kind of power fencing available you should use that. It will solve those problems. 


Robert Gil
Linux Systems Administrator
American Home Mortgage

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jos? Miguel Parrella Romero
Sent: Friday, July 13, 2007 12:19 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Two-node clusters using GFS and shared storage

Greetings,

I've been trying to setup a two-node cluster using a shared SAN (via Fibre Channel) and GFS. I've previously tried OCFS2, and I don't want to use NFS yet. The cluster must be an active-active one, and it runs on
Itanium2 machines with Debian 4.0. I'm using cman 1.03.00

I've setup a cluster using Red Hat tools, and my /etc/cluster/cluster.conf looks like:

-- my cluster.conf --

<?xml version="1.0"?>
<cluster name="correo" config_version="1">

<cman two_node="1" expected_votes="1">
</cman>

<clusternodes>

<clusternode name="node1" votes="1">
</clusternode>

<clusternode name="node2" votes="1">
</clusternode>

</clusternodes>

</cluster>

-- end my cluster.conf --

Note that I've removed entries related to fencing, but I previously had a 'manual' fencing method. So I've an LVM volume which contains a GFS filesystem, and I'm able to start ccsd, cman, fenced, clvmd and all the other related applications.

Syslog reports that the cluster is quorate, and I'm able to mount the filesystem in both of my nodes. They need to write to the shared storage in an active-active fashion.

I expect that removing the network cable in node1 would do the following:

a) node1 would be disabled (right, it doesn't have a network cable)
b) node2 would notice node1 is not there and will keep writing to the shared storage
c) Eventually node1 will come back, and node2 will notice it, so it will hopefully start writing again

And this it what happens when I unplug the network cable:

a) node1 is disabled (no connectivity)
b) node2 is also disabled! (trying to write to /home and /var/mail stalls the machine, and then logins and other processes are stalled)
c) Plugging the cable back does nothing (both machines are hanged now, so I need to reboot them)

I'm probably missing something, since this solution using OCFS2 also has the same problem! Our last-resort solution is active-active NFS using Heartbeat, but then we wouldn't be writing to the SAN through FC (2Gbps) but through Ethernet (1Gbps) since we don't have any other media around ATM.

Is this a configuration related problem? Or is this a design feature in  both GFS/OCFS2? Or maybe I'm just missing the whole picture...

Thank you very much for any advice,
Jose

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From J.vandenHorn at xb.nl  Fri Jul 13 16:32:22 2007
From: J.vandenHorn at xb.nl (Jeroen van den Horn)
Date: Fri, 13 Jul 2007 18:32:22 +0200
Subject: [Linux-cluster] Two-node clusters using GFS and shared storage
In-Reply-To: <4697A5DC.5010705@gmail.com>
References: <4697A5DC.5010705@gmail.com>
Message-ID: <4697A916.1030401@xb.nl>

Jos?,

Fencing is not optional but mandatory for GFS. Once the failing node is 
detected the cluster nodes will *wait* until the failing node is 
successfully fenced. Once fenced (i.e. power-cycled or disconnected from 
the SAN) one of the cluster nodes will replay the journal of the failing 
node and GFS operation continues. Without fencing the cluster will hang 
on any lock that is obtained by the failing node (like your hanging 
systems).

Install a proper fencing agent for operational use. For testing purposes 
you could use manual fencing (i.e. run fence_manual).

PS: Plugging the cable back in without power-cycling is a NO-GO. The 
failing node is no longer in-sync with the rest of the cluster (they 
assume the machine has been power-cycled after a manual fence) - you 
will risk GFS filesystem corruption by attaching it back to the storage 
without proper fencing procedures!

Jeroen

Jos? Miguel Parrella Romero wrote:
> Greetings,
>
> I've been trying to setup a two-node cluster using a shared SAN (via
> Fibre Channel) and GFS. I've previously tried OCFS2, and I don't want to
> use NFS yet. The cluster must be an active-active one, and it runs on
> Itanium2 machines with Debian 4.0. I'm using cman 1.03.00
>
> I've setup a cluster using Red Hat tools, and my
> /etc/cluster/cluster.conf looks like:
>
> -- my cluster.conf --
>
> <?xml version="1.0"?>
> <cluster name="correo" config_version="1">
>
> <cman two_node="1" expected_votes="1">
> </cman>
>
> <clusternodes>
>
> <clusternode name="node1" votes="1">
> </clusternode>
>
> <clusternode name="node2" votes="1">
> </clusternode>
>
> </clusternodes>
>
> </cluster>
>
> -- end my cluster.conf --
>
> Note that I've removed entries related to fencing, but I previously had
> a 'manual' fencing method. So I've an LVM volume which contains a GFS
> filesystem, and I'm able to start ccsd, cman, fenced, clvmd and all the
> other related applications.
>
> Syslog reports that the cluster is quorate, and I'm able to mount the
> filesystem in both of my nodes. They need to write to the shared storage
> in an active-active fashion.
>
> I expect that removing the network cable in node1 would do the following:
>
> a) node1 would be disabled (right, it doesn't have a network cable)
> b) node2 would notice node1 is not there and will keep writing to the
> shared storage
> c) Eventually node1 will come back, and node2 will notice it, so it will
> hopefully start writing again
>
> And this it what happens when I unplug the network cable:
>
> a) node1 is disabled (no connectivity)
> b) node2 is also disabled! (trying to write to /home and /var/mail
> stalls the machine, and then logins and other processes are stalled)
> c) Plugging the cable back does nothing (both machines are hanged now,
> so I need to reboot them)
>
> I'm probably missing something, since this solution using OCFS2 also has
> the same problem! Our last-resort solution is active-active NFS using
> Heartbeat, but then we wouldn't be writing to the SAN through FC (2Gbps)
> but through Ethernet (1Gbps) since we don't have any other media around ATM.
>
> Is this a configuration related problem? Or is this a design feature in
>  both GFS/OCFS2? Or maybe I'm just missing the whole picture...
>
> Thank you very much for any advice,
> Jose
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From bfilipek at crscold.com  Fri Jul 13 17:21:30 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Fri, 13 Jul 2007 12:21:30 -0500
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>

I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
of my servers (Dell PowerEdge 1950's) have a QLogic QLE2462 HBA. Does
anyone have a RHEL5 cluster running with a similar fiber channel setup?
I am looking for the proper drivers for these HBA's, but Dell, EMC, and
QLogic are telling me that this configuration is not yet supported so
there are no drivers yet. I would also like to install SANSurfer if
possible. 

 

Thanks for any help. 

 

 

Brad Filipek



 


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070713/38c545b1/attachment.htm>

From fait at anl.gov  Sat Jul 14 18:14:15 2007
From: fait at anl.gov (James Fait)
Date: Sat, 14 Jul 2007 13:14:15 -0500
Subject: [Linux-cluster] GNBD/GFS question
Message-ID: <1184436855.29495.28.camel@com30.ser.aps.anl.gov>

I have run into the following problem using GNBD/GFS on a 4 node test
cluster I have set up.  I have been testing failure scenarios to make
sure that things will be handled correctly when a node is fenced using
fence_gnbd.
The cluster is a set of 4 dell 1u servers, with dual 1GB network
connections, one to the intranet, and the other to a storage net.  The
cluster communications all happen on the storage net.  Each machine has
a dns name, which is decoded to the hostname for logs, which is the
intranet address. The storage net also has name to ip resolution via the
hosts file, to names of the form cluster*.  Cluster29 is the gnbd server
node, and does not mount the gfs volume at all.  The other three nodes
do mount the gfs volume.

This is the test that is currently causing the problem.
        I kill the network link for cluster28(com28) using the network
        switch to temporarily disable the link.
        
        Cluster28 is fenced successfully. One of the other nodes handles
        the gfs cleanup.
        
        The network link is restored.
        
        Cluster28 attempts an orderly shutdown, and hangs on umount of
        gfs.  Later, it is power cycled to force reboot.
        
        Cluster28 rejoins the cluster, and attempts to mount gfs.  The
        gnbd_import command fails with the error messages:
                gnbd_recvd: ERROR login refused by the server,
                quitting : Operation not permitted
                gnbd_import: ERROR gnbd_recvd failed
        
        Cluster29 reports the error:
                com29 gnbd_serv[4794]: ERROR [gserv.c:468] client
                cluster28 is banned. Canceling login

What is happening, and what can I do about it?

Sincerly

Jim
-- 
James Fait, Ph.D.
Beamline Scientist, SER-CAT
APS, building 436B-008
Argonne National Laboratory
9700 S Cass Ave
Argonne, IL 60439
phone 630-252-0644
fax   630-252-0652
email fait at anl.gov



From bsd_daemon at msn.com  Sat Jul 14 18:36:32 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Sat, 14 Jul 2007 18:36:32 +0000
Subject: [Linux-cluster] GNBD/GFS question
In-Reply-To: <1184436855.29495.28.camel@com30.ser.aps.anl.gov>
Message-ID: <BAY105-F382E693C654C0F5713150BE3FE0@phx.gbl>


hi james, are you using which the "gnbd_export" parameters ?? if you don't 
use "-c", you get this error (Cancelled login). Because the device has 
timeout. you check fence with "gnbd_import -c GNBD_SERVER_NAME" on other 
nodes...

you should use "gnbd_import -s XXXX -t YYYY" on GNBD_SERVER. XXXX is 
node_name or node_IP. YYYY is gnbd_server_name or gnbd_server_IP.

have a nice day..

Mehmet CELIK
Istanbul/TURKEY

>From: James Fait <fait at anl.gov>
>Reply-To: linux.clustering at redhat.com,linux clustering 
><linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: [Linux-cluster] GNBD/GFS question
>Date: Sat, 14 Jul 2007 13:14:15 -0500
>
>I have run into the following problem using GNBD/GFS on a 4 node test
>cluster I have set up.  I have been testing failure scenarios to make
>sure that things will be handled correctly when a node is fenced using
>fence_gnbd.
>The cluster is a set of 4 dell 1u servers, with dual 1GB network
>connections, one to the intranet, and the other to a storage net.  The
>cluster communications all happen on the storage net.  Each machine has
>a dns name, which is decoded to the hostname for logs, which is the
>intranet address. The storage net also has name to ip resolution via the
>hosts file, to names of the form cluster*.  Cluster29 is the gnbd server
>node, and does not mount the gfs volume at all.  The other three nodes
>do mount the gfs volume.
>
>This is the test that is currently causing the problem.
>         I kill the network link for cluster28(com28) using the network
>         switch to temporarily disable the link.
>
>         Cluster28 is fenced successfully. One of the other nodes handles
>         the gfs cleanup.
>
>         The network link is restored.
>
>         Cluster28 attempts an orderly shutdown, and hangs on umount of
>         gfs.  Later, it is power cycled to force reboot.
>
>         Cluster28 rejoins the cluster, and attempts to mount gfs.  The
>         gnbd_import command fails with the error messages:
>                 gnbd_recvd: ERROR login refused by the server,
>                 quitting : Operation not permitted
>                 gnbd_import: ERROR gnbd_recvd failed
>
>         Cluster29 reports the error:
>                 com29 gnbd_serv[4794]: ERROR [gserv.c:468] client
>                 cluster28 is banned. Canceling login
>
>What is happening, and what can I do about it?
>
>Sincerly
>
>Jim
>--
>James Fait, Ph.D.
>Beamline Scientist, SER-CAT
>APS, building 436B-008
>Argonne National Laboratory
>9700 S Cass Ave
>Argonne, IL 60439
>phone 630-252-0644
>fax   630-252-0652
>email fait at anl.gov
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_pcmag_0507



From pillai at mathstat.dal.ca  Sat Jul 14 18:45:45 2007
From: pillai at mathstat.dal.ca (Balagopal Pillai)
Date: Sat, 14 Jul 2007 15:45:45 -0300 (ADT)
Subject: [Linux-cluster] GNBD/GFS question
In-Reply-To: <BAY105-F382E693C654C0F5713150BE3FE0@phx.gbl>
References: <BAY105-F382E693C654C0F5713150BE3FE0@phx.gbl>
Message-ID: <Pine.LNX.4.64.0707141541380.6955@chase.mathstat.dal.ca>

On Sat, 14 Jul 2007, mehmet celik wrote:
Hi,

        I have seen this before when i was evaluating gfs, in the scenario 
when a fenced node, when rebooted and is back up, still stays fenced. Please try manual unfencing  
like this. This worked for me -
gnbd_import -u <node name> -t <gnbd server>

Regards
Balagopal

> 
> hi james, are you using which the "gnbd_export" parameters ?? if you don't use
> "-c", you get this error (Cancelled login). Because the device has timeout.
> you check fence with "gnbd_import -c GNBD_SERVER_NAME" on other nodes...
> 
> you should use "gnbd_import -s XXXX -t YYYY" on GNBD_SERVER. XXXX is node_name
> or node_IP. YYYY is gnbd_server_name or gnbd_server_IP.
> 
> have a nice day..
> 
> Mehmet CELIK
> Istanbul/TURKEY
> 
> >From: James Fait <fait at anl.gov>
> >Reply-To: linux.clustering at redhat.com,linux clustering
> ><linux-cluster at redhat.com>
> >To: linux clustering <linux-cluster at redhat.com>
> >Subject: [Linux-cluster] GNBD/GFS question
> >Date: Sat, 14 Jul 2007 13:14:15 -0500
> >
> >I have run into the following problem using GNBD/GFS on a 4 node test
> >cluster I have set up.  I have been testing failure scenarios to make
> >sure that things will be handled correctly when a node is fenced using
> >fence_gnbd.
> >The cluster is a set of 4 dell 1u servers, with dual 1GB network
> >connections, one to the intranet, and the other to a storage net.  The
> >cluster communications all happen on the storage net.  Each machine has
> >a dns name, which is decoded to the hostname for logs, which is the
> >intranet address. The storage net also has name to ip resolution via the
> >hosts file, to names of the form cluster*.  Cluster29 is the gnbd server
> >node, and does not mount the gfs volume at all.  The other three nodes
> >do mount the gfs volume.
> >
> >This is the test that is currently causing the problem.
> >         I kill the network link for cluster28(com28) using the network
> >         switch to temporarily disable the link.
> >
> >         Cluster28 is fenced successfully. One of the other nodes handles
> >         the gfs cleanup.
> >
> >         The network link is restored.
> >
> >         Cluster28 attempts an orderly shutdown, and hangs on umount of
> >         gfs.  Later, it is power cycled to force reboot.
> >
> >         Cluster28 rejoins the cluster, and attempts to mount gfs.  The
> >         gnbd_import command fails with the error messages:
> >                 gnbd_recvd: ERROR login refused by the server,
> >                 quitting : Operation not permitted
> >                 gnbd_import: ERROR gnbd_recvd failed
> >
> >         Cluster29 reports the error:
> >                 com29 gnbd_serv[4794]: ERROR [gserv.c:468] client
> >                 cluster28 is banned. Canceling login
> >
> >What is happening, and what can I do about it?
> >
> >Sincerly
> >
> >Jim
> >--
> >James Fait, Ph.D.
> >Beamline Scientist, SER-CAT
> >APS, building 436B-008
> >Argonne National Laboratory
> >9700 S Cass Ave
> >Argonne, IL 60439
> >phone 630-252-0644
> >fax   630-252-0652
> >email fait at anl.gov
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> _________________________________________________________________
> http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_pcmag_0507
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 



From bsd_daemon at msn.com  Sat Jul 14 18:47:56 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Sat, 14 Jul 2007 18:47:56 +0000
Subject: [Linux-cluster] GNBD/GFS question
In-Reply-To: <1184436855.29495.28.camel@com30.ser.aps.anl.gov>
Message-ID: <BAY105-F3294CE1F1D14CD0A21A13AE3FE0@phx.gbl>


hi james, i am sorry...

>>>  you should use "gnbd_import -s XXXX -t YYYY" on GNBD_SERVER. XXXX is 
>>>node_name or node_IP. YYYY is gnbd_server_name or gnbd_server_IP.

may be "gnbd_import -u XXXX -t YYYY". right now, I dont remember.

additionally to this, see man page of "gnbd_import".

Mehmet CELIK
Istanbul/TURKEY

>From: James Fait <fait at anl.gov>
>Reply-To: linux.clustering at redhat.com,linux clustering 
><linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: [Linux-cluster] GNBD/GFS question
>Date: Sat, 14 Jul 2007 13:14:15 -0500
>
>I have run into the following problem using GNBD/GFS on a 4 node test
>cluster I have set up.  I have been testing failure scenarios to make
>sure that things will be handled correctly when a node is fenced using
>fence_gnbd.
>The cluster is a set of 4 dell 1u servers, with dual 1GB network
>connections, one to the intranet, and the other to a storage net.  The
>cluster communications all happen on the storage net.  Each machine has
>a dns name, which is decoded to the hostname for logs, which is the
>intranet address. The storage net also has name to ip resolution via the
>hosts file, to names of the form cluster*.  Cluster29 is the gnbd server
>node, and does not mount the gfs volume at all.  The other three nodes
>do mount the gfs volume.
>
>This is the test that is currently causing the problem.
>         I kill the network link for cluster28(com28) using the network
>         switch to temporarily disable the link.
>
>         Cluster28 is fenced successfully. One of the other nodes handles
>         the gfs cleanup.
>
>         The network link is restored.
>
>         Cluster28 attempts an orderly shutdown, and hangs on umount of
>         gfs.  Later, it is power cycled to force reboot.
>
>         Cluster28 rejoins the cluster, and attempts to mount gfs.  The
>         gnbd_import command fails with the error messages:
>                 gnbd_recvd: ERROR login refused by the server,
>                 quitting : Operation not permitted
>                 gnbd_import: ERROR gnbd_recvd failed
>
>         Cluster29 reports the error:
>                 com29 gnbd_serv[4794]: ERROR [gserv.c:468] client
>                 cluster28 is banned. Canceling login
>
>What is happening, and what can I do about it?
>
>Sincerly
>
>Jim
>--
>James Fait, Ph.D.
>Beamline Scientist, SER-CAT
>APS, building 436B-008
>Argonne National Laboratory
>9700 S Cass Ave
>Argonne, IL 60439
>phone 630-252-0644
>fax   630-252-0652
>email fait at anl.gov
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_2G_0507



From bsd_daemon at msn.com  Sat Jul 14 18:50:04 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Sat, 14 Jul 2007 18:50:04 +0000
Subject: [Linux-cluster] GNBD/GFS question
In-Reply-To: <Pine.LNX.4.64.0707141541380.6955@chase.mathstat.dal.ca>
Message-ID: <BAY105-F97AC6B718C1F6AB77FF9AE3FE0@phx.gbl>


hi Balagopal, thanx for information..

have a nice day..

>From: Balagopal Pillai <pillai at mathstat.dal.ca>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: RE: [Linux-cluster] GNBD/GFS question
>Date: Sat, 14 Jul 2007 15:45:45 -0300 (ADT)
>
>On Sat, 14 Jul 2007, mehmet celik wrote:
>Hi,
>
>         I have seen this before when i was evaluating gfs, in the scenario
>when a fenced node, when rebooted and is back up, still stays fenced. 
>Please try manual unfencing
>like this. This worked for me -
>gnbd_import -u <node name> -t <gnbd server>
>
>Regards
>Balagopal
>
> >
> > hi james, are you using which the "gnbd_export" parameters ?? if you 
>don't use
> > "-c", you get this error (Cancelled login). Because the device has 
>timeout.
> > you check fence with "gnbd_import -c GNBD_SERVER_NAME" on other nodes...
> >
> > you should use "gnbd_import -s XXXX -t YYYY" on GNBD_SERVER. XXXX is 
>node_name
> > or node_IP. YYYY is gnbd_server_name or gnbd_server_IP.
> >
> > have a nice day..
> >
> > Mehmet CELIK
> > Istanbul/TURKEY
> >
> > >From: James Fait <fait at anl.gov>
> > >Reply-To: linux.clustering at redhat.com,linux clustering
> > ><linux-cluster at redhat.com>
> > >To: linux clustering <linux-cluster at redhat.com>
> > >Subject: [Linux-cluster] GNBD/GFS question
> > >Date: Sat, 14 Jul 2007 13:14:15 -0500
> > >
> > >I have run into the following problem using GNBD/GFS on a 4 node test
> > >cluster I have set up.  I have been testing failure scenarios to make
> > >sure that things will be handled correctly when a node is fenced using
> > >fence_gnbd.
> > >The cluster is a set of 4 dell 1u servers, with dual 1GB network
> > >connections, one to the intranet, and the other to a storage net.  The
> > >cluster communications all happen on the storage net.  Each machine has
> > >a dns name, which is decoded to the hostname for logs, which is the
> > >intranet address. The storage net also has name to ip resolution via 
>the
> > >hosts file, to names of the form cluster*.  Cluster29 is the gnbd 
>server
> > >node, and does not mount the gfs volume at all.  The other three nodes
> > >do mount the gfs volume.
> > >
> > >This is the test that is currently causing the problem.
> > >         I kill the network link for cluster28(com28) using the network
> > >         switch to temporarily disable the link.
> > >
> > >         Cluster28 is fenced successfully. One of the other nodes 
>handles
> > >         the gfs cleanup.
> > >
> > >         The network link is restored.
> > >
> > >         Cluster28 attempts an orderly shutdown, and hangs on umount of
> > >         gfs.  Later, it is power cycled to force reboot.
> > >
> > >         Cluster28 rejoins the cluster, and attempts to mount gfs.  The
> > >         gnbd_import command fails with the error messages:
> > >                 gnbd_recvd: ERROR login refused by the server,
> > >                 quitting : Operation not permitted
> > >                 gnbd_import: ERROR gnbd_recvd failed
> > >
> > >         Cluster29 reports the error:
> > >                 com29 gnbd_serv[4794]: ERROR [gserv.c:468] client
> > >                 cluster28 is banned. Canceling login
> > >
> > >What is happening, and what can I do about it?
> > >
> > >Sincerly
> > >
> > >Jim
> > >--
> > >James Fait, Ph.D.
> > >Beamline Scientist, SER-CAT
> > >APS, building 436B-008
> > >Argonne National Laboratory
> > >9700 S Cass Ave
> > >Argonne, IL 60439
> > >phone 630-252-0644
> > >fax   630-252-0652
> > >email fait at anl.gov
> > >
> > >--
> > >Linux-cluster mailing list
> > >Linux-cluster at redhat.com
> > >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > _________________________________________________________________
> > 
>http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_pcmag_0507
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
http://im.live.com/messenger/im/home/?source=hmtextlinkjuly07



From maciej.bogucki at artegence.com  Sat Jul 14 19:52:58 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Sat, 14 Jul 2007 21:52:58 +0200
Subject: [Linux-cluster] Two-node clusters using GFS and shared storage
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA8AF@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <F154BEC9D4278D4BA9B94BDEC76193482BA8AF@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <4699299A.9040805@artegence.com>

Robert Gil napisa?(a):
> Redhat recommends that you use hardware based fencing. 

If You don't want(or can) to use hardware based fencinf, You could try
to use new fence_scsi agent. If You are using a shared SAN (via Fibre
Channel) then You need to check if Your SAN supports Persistent
Reservations (SPC-2 or greater).

Best Regarda
Maciej Bogucki



From maciej.bogucki at artegence.com  Sat Jul 14 20:06:03 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Sat, 14 Jul 2007 22:06:03 +0200
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>
Message-ID: <46992CAB.5040509@artegence.com>

Brad Filipek napisa?(a):
> I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
> of my servers (Dell PowerEdge 1950?s) have a QLogic QLE2462 HBA. Does
> anyone have a RHEL5 cluster running with a similar fiber channel setup?
> I am looking for the proper drivers for these HBA?s, but Dell, EMC, and
> QLogic are telling me that this configuration is not yet supported so
> there are no drivers yet. I would also like to install SANSurfer if
> possible.
> 

Hello,

Qlogix QLE2462 is supported by Linux [1], and here is Linux driver [2].
If standard RHEL5 doesn't have driver for FC card You could crate it on
another machine and add it(via floppy) at installation time.

[1] - http://www.cybernetech.co.jp/pdf/qle2462.pdf
[2] -
http://support.qlogic.com/support/oem_detail_hds.asp?oemid=84&classid=237

Best Regards
Maciej Bogucki



From bsd_daemon at msn.com  Sun Jul 15 10:58:15 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Sun, 15 Jul 2007 10:58:15 +0000
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>
Message-ID: <BAY105-F31D4EFBD214846DB45BB8CE3FF0@phx.gbl>


hi brad, you should look the RHEL Advanced Platform.

Have a nice day..

Mehmet CELIK
Istanbul/TURKEY

>From: "Brad Filipek" <bfilipek at crscold.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
>Date: Fri, 13 Jul 2007 12:21:30 -0500
>
>I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
>of my servers (Dell PowerEdge 1950's) have a QLogic QLE2462 HBA. Does
>anyone have a RHEL5 cluster running with a similar fiber channel setup?
>I am looking for the proper drivers for these HBA's, but Dell, EMC, and
>QLogic are telling me that this configuration is not yet supported so
>there are no drivers yet. I would also like to install SANSurfer if
>possible.
>
>
>
>Thanks for any help.
>
>
>
>
>
>Brad Filipek
>
>
>
>
>
>
>Confidentiality Notice: This message is intended for the use of the 
>individual or entity to which it is addressed and may contain information 
>that is privileged, confidential and exempt from disclosure under 
>applicable law. If the reader of this message is not the intended recipient 
>or the employee or agent responsible for delivering this message to the 
>intended recipient, you are hereby notified that any dissemination, 
>distribution or copying of this communication is strictly prohibited.
>
>If you have received this communication in error, please notify us 
>immediately by email reply or by telephone and immediately delete this 
>message and any attachments.
>


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Need a brain boost? Recharge with a stimulating game. Play now!? 
http://club.live.com/home.aspx?icid=club_hotmailtextlink1



From grimme at atix.de  Sun Jul 15 13:21:48 2007
From: grimme at atix.de (Marc Grimme)
Date: Sun, 15 Jul 2007 15:21:48 +0200
Subject: [Linux-cluster] gfs_scand
Message-ID: <200707151521.48246.grimme@atix.de>

Hello,
does anybody now what exactly is the task of gfs_scand. We see it with very 
much CPU time loads of times (eg. system is up for 40h and gfs_scand has 4h 
CPU-Time).
And can you track down which scand is responsible for what filesystem?
BTW: I'm talking about RHEL4U4.
-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From wcheng at redhat.com  Sun Jul 15 14:44:07 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Sun, 15 Jul 2007 10:44:07 -0400
Subject: [Linux-cluster] gfs_scand
In-Reply-To: <200707151521.48246.grimme@atix.de>
References: <200707151521.48246.grimme@atix.de>
Message-ID: <469A32B7.5090402@redhat.com>

Marc Grimme wrote:

>Hello,
>does anybody now what exactly is the task of gfs_scand. We see it with very 
>much CPU time loads of times (eg. system is up for 40h and gfs_scand has 4h 
>CPU-Time).
>And can you track down which scand is responsible for what filesystem?
>BTW: I'm talking about RHEL4U4.
>  
>

This is a complicated subject. So please bear with me and see whether 
the following description helps:

Gfs_scand scans GFS locks (glock) hash table to find:
1. if glock can be downgraded into less restricted state (say from 
shared state to unlock state) (and dirty data flushing is embedded in 
the glock transition code).
2. if glock is idle and in unlock state for too long, it will be reclaimed.

Whenever GFS needs a lock, it creates a glock and subsequently asks lock 
manager for a corresponding lock. In DLM case, there is one-to-one 
correspondence between glock and dlm lock.

Now if gfs_scand has used too much CPU time, it may mean the system has 
accumulated too many locks as described in:
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4

Unfortunately the lock trimming patch added into RHEL 4.5 is too "mild" 
(i.e. not aggressive enough, see Red Hat bugzilla 245776). We'll try to 
correct the issue as soon as next errata is available. In short, if the 
daemon has hogged too much CPU time without any sign of slowing down 
whenever it wakes up, you can try to make it run less often by:
shell> gfs_tool settune <mount_point> scand_secs <x> 
          // the default x is 5 seconds

The side effect of longer scand_secs is that if you have large amount of 
file write and/or delete activities, the dirty data will stay in the 
buffer cache for longer time and lock count will up considerably.

-- Wendy







From bernard.chew at muvee.com  Sun Jul 15 16:09:07 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Mon, 16 Jul 2007 00:09:07 +0800
Subject: [Linux-cluster] CLVM <SOLVED>
In-Reply-To: <229C73600EB0E54DA818AB599482BCE901816870@shadowfax.sg.muvee.net>
References: <229C73600EB0E54DA818AB599482BCE90181667F@shadowfax.sg.muvee.net><F154BEC9D4278D4BA9B94BDEC76193482BA818@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<229C73600EB0E54DA818AB599482BCE901816870@shadowfax.sg.muvee.net>
Message-ID: <229C73600EB0E54DA818AB599482BCE901816F47@shadowfax.sg.muvee.net>

Hi,

The problem has been solved. There was nothing wrong with the setup; it
was the router which was giving the problems (the internal router runs
RHEL4 and forwarding was not properly configured). I found out about the
problem when I decide to start cman with the external-facing network
interface (the necessary changes were made to /etc/hosts & cluster.conf)
instead of the private one. 

To specify the hostname cman service uses: cman_tool join -n
"hostname.domain.com"

Hope the information helps anyone who encounters the same problem.

Regards,
Bernard Chew

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew
Sent: Monday, July 09, 2007 11:54 PM
To: linux clustering
Subject: RE: [Linux-cluster] CLVM

Hi,

I should probably provide the info from "cman_tool nodes" as well (in
addition to the info below) so that the info is clearer...

[root at server2 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    3   M   server3 <- first node which started CLVMD
   3    1    3   M   server2

Regards,
Bernard

-------------------------

-----Original Message-----
From: Bernard Chew 
Sent: Monday, July 09, 2007 10:46 PM
To: 'linux clustering'
Subject: RE: [Linux-cluster] CLVM

Hi,

Thanks for the quick reply. Here is the info that I gathered;

--------------------------------------------
Node 1 (which started CLVMD successfully):

[root at server3]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,3,0
clvmd move use event 3
clvmd recover event 3 (first)
clvmd add nodes
clvmd total nodes 1
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd recover event 3 done
clvmd move flags 0,0,1 ids 0,3,3
clvmd process held requests
clvmd processed 0 requests
clvmd recover event 3 finished
clvmd move flags 1,0,0 ids 3,3,3
clvmd move flags 0,1,0 ids 3,4,3
clvmd move use event 4
clvmd recover event 4
clvmd add node 3

[root at server3 ~]# cman_tool services
Service          Name                              GID LID State
Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 update
U-4,1,3
[1 3]

--------------------------------------------
Node 2 (which just wait forever):

[root at server2 ~]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,2,0
clvmd move use event 2
clvmd recover event 2 (first)
clvmd add nodes

[root at server2 ~]# cman_tool services
Service          Name                              GID LID State
Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 join
S-6,20,2
[1 3]
--------------------------------------------

Thanks for any help,
Bernard

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Gil
Sent: Monday, July 09, 2007 8:29 PM
To: linux clustering
Subject: RE: [Linux-cluster] CLVM

What error do you get? 


Robert Gil
Linux Systems Administrator
American Home Mortgage

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew
Sent: Saturday, July 07, 2007 3:57 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CLVM

Hi,

I have problems starting clvmd in a second node (after starting it
successfully on the first node) of a newly created 4-nodes cluster; no
problems staring the service first time on any nodes. Running "cman_tool
services" will show that the second node which started clvmd is in a
"join" status while the first node show a "update" status. This remains
even after a long period of time.

Given that the directories (i.e. /var /usr / ) are created using the
default lvm manager during installation, and I install the
lvm2-cluster-2.02.06-7.0.RHEL4.x86_64.rpm subsequently as part of the
requirements to set up GFS, will this cause the clvmd not to start
properly? I have no problems starting ccsd, cman and fenced.

Thanks in advance,
Bernard Chew
IT Operations Engineer


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From hal_bg at yahoo.com  Sun Jul 15 22:12:52 2007
From: hal_bg at yahoo.com (Hal)
Date: Sun, 15 Jul 2007 15:12:52 -0700 (PDT)
Subject: [Linux-cluster] CLVM, GFS and GNBD scenarios?
In-Reply-To: <229C73600EB0E54DA818AB599482BCE901816F47@shadowfax.sg.muvee.net>
Message-ID: <862208.89379.qm@web32201.mail.mud.yahoo.com>

Hi everybody,
I am interested in building relatively large inexpensive cluster but the
storage turned to be a problem. I need both high performance and high
availability. So I am thinking of some combination of GLVM, GFS and GNBD
without using expensive storage devices and fence devices. I want to use
several (~10-15) nodes that export separate local disks using GNBD and and to
use CLVM on ~100 importing nodes to make one redundant logical volume on all of
them with GFS on it. My questions are:
1. Is it possible at all?
2. Can CLVM make redundant logical volumes from imported GNBDs?
3. Is it possible to boost the performance by using some sort 
   of device mirroring or something like this in the described above scenario?
4. Does GNBD fence require any additional hardware to make fencing possible?
5. Will it be possible to use the described above FS if one of the exporting 
   nodes fails without fencing it? Will it be possible to restore the data on 
   it as on a RAID device?

Thanks in advance! 

Hal


       
____________________________________________________________________________________
Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow  



From zak at yellowfiber.net  Mon Jul 16 04:12:32 2007
From: zak at yellowfiber.net (Zak Thompson)
Date: Mon, 16 Jul 2007 00:12:32 -0400
Subject: [Linux-cluster] crash during gfs_grow
Message-ID: <A5E89E21CE21AF4186E3AC8E368B2A16040617@usvarstyfnexch1.yellowfiber.net>

Hello all, last night we were growing the gfs system and during the
growth we had a panic and had to bounce the machines.  Now everything is
back up and the lvm is showing the correct disk size  however when we
run a df it is still showing the same disk space/usage/  well this isn't
good we now lost 1tb of new disk space.  Any suggestions on how to get
this back?

 

Cheers,

Zak

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070716/b3298dd7/attachment.htm>

From Alain.Moulle at bull.net  Mon Jul 16 13:03:24 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 16 Jul 2007 15:03:24 +0200
Subject: [Linux-cluster] CS4/CS5 ///// RHEL4/RHEL5 
Message-ID: <469B6C9C.20006@bull.net>

Hi

I would like to know :

Is there a CS5 planned for RHEL5 ?

If so :
1/Is it mandatory on CS5, or is it possible to compile
a CS4 U5 to work on RHEL5 ?

2/ when CS5 will  be available ?

3/ Which are or will be the main enhancements of CS5 versus CS4 U5 ?

Thanks a lot
Alain Moull?

-- 




From sebastia at l00-bugdead-prods.de  Mon Jul 16 13:22:28 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Mon, 16 Jul 2007 15:22:28 +0200
Subject: [Linux-cluster] AIS Executive couldn't open configuration object
	database component
Message-ID: <20070716132229.2E07247C77@l00-bugdead-prods.de>

Hi,

I wanted to try out gfs2 instead of ocfs2 because of the acl's that it 
supports, but I do not get it running. I am on openSUSE 10.2, x86_64.

my kernel is 
Linux srv4 2.6.20.15-default #1 SMP Fri Jul 13 12:44:51 CEST 2007 x86_64 
x86_64 x86_64 GNU/Linux

I have openais rpm's created from source rpm's:
openais-devel-0.80.1-6
openais-0.80.1-6

and I use cluster-2.00.00, just compiled from source. My *.lcrso files are 
in /usr/lib64/lcrso/ and /usr/libexec/lcrso/. The LD_LIBRARY_PATH and the 
PATH variable have both pathes included.

when I start /etc/init.d/cman then I see the following:
/etc/init.d/cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
/usr/sbin/cman_tool: aisexec daemon didn't start
/etc/init.d/cman: line 413: failure: command not found

and this is /var/log/messages:

Jul 16 15:10:00 srv4 ccsd[23855]: Starting ccsd 2.00.00:
Jul 16 15:10:00 srv4 ccsd[23855]:  Built: Jul 13 2007 13:24:27
Jul 16 15:10:00 srv4 ccsd[23855]:  Copyright (C) Red Hat, Inc.  2004  All 
rights reserved.
Jul 16 15:10:00 srv4 ccsd[23855]: cluster.conf (cluster name = correo, 
version = 1) found.
Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive Service RELEASE 'subrev 
1204 version 0.80.1'
Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2002-2006 MontaVista 
Software, Inc and contributors.
Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive couldn't open 
configuration object database component.
Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive exiting (-13).


When I run cman_tool -w join manually, the same happens, when I run it 
within strace, I see it trying sometimes to connect to the 
socket /var/run/cman_admin, and then it fails, because the socket is not 
there.

I can start aisexec, but it does not create sockets /var/run/cman_client 
and /var/run/cman_admin. 









below my cluster.conf file:


<?xml version="1.0"?>
<cluster name="correo" config_version="1">
  <cman two_node="1" expected_votes="1">
</cman>

<clusternodes>

<clusternode name="srv4" votes="1">
        <fence>
                <method name="single">
                        <device name="ilo_srv4"/>
                </method>
        </fence>
</clusternode>

<clusternode name="srv5" votes="1">
        <fence>
                <method name="single">
                        <device name="ilo_srv5"/>
                </method>
        </fence>
</clusternode>

</clusternodes>

<fencedevices>
        <fencedevice name="ilo_srv4" agent="fence_ilo" 
ipaddr="192.168.9.180" login="ilo" />
        <fencedevice name="ilo_srv5" agent="fence_ilo" 
ipaddr="192.168.9.181" login="ilo" />
</fencedevices>

</cluster>


and my openais.conf file:

# Please read the openais.conf.5 manual page

totem {
        version: 2
        secauth: off
        interface {
                bindnetaddr: 192.168.8.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
}

logging {
        debug: on
        timestamp: on
}

amf {
        mode: disabled
}


any hint what my problem is, and how to get the cman running?


kind regards
Sebastian



From pcaulfie at redhat.com  Mon Jul 16 13:32:25 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 16 Jul 2007 14:32:25 +0100
Subject: [Linux-cluster] AIS Executive couldn't open configuration object
	database component
In-Reply-To: <20070716132229.2E07247C77@l00-bugdead-prods.de>
References: <20070716132229.2E07247C77@l00-bugdead-prods.de>
Message-ID: <469B7369.90703@redhat.com>

Sebastian Reitenbach wrote:
> Hi,
> 
> I wanted to try out gfs2 instead of ocfs2 because of the acl's that it 
> supports, but I do not get it running. I am on openSUSE 10.2, x86_64.
> 
> my kernel is 
> Linux srv4 2.6.20.15-default #1 SMP Fri Jul 13 12:44:51 CEST 2007 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> I have openais rpm's created from source rpm's:
> openais-devel-0.80.1-6
> openais-0.80.1-6
> 
> and I use cluster-2.00.00, just compiled from source. My *.lcrso files are 
> in /usr/lib64/lcrso/ and /usr/libexec/lcrso/. The LD_LIBRARY_PATH and the 
> PATH variable have both pathes included.
> 
> when I start /etc/init.d/cman then I see the following:
> /etc/init.d/cman start
> Starting cluster:
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... failed
> /usr/sbin/cman_tool: aisexec daemon didn't start
> /etc/init.d/cman: line 413: failure: command not found
> 
> and this is /var/log/messages:
> 
> Jul 16 15:10:00 srv4 ccsd[23855]: Starting ccsd 2.00.00:
> Jul 16 15:10:00 srv4 ccsd[23855]:  Built: Jul 13 2007 13:24:27
> Jul 16 15:10:00 srv4 ccsd[23855]:  Copyright (C) Red Hat, Inc.  2004  All 
> rights reserved.
> Jul 16 15:10:00 srv4 ccsd[23855]: cluster.conf (cluster name = correo, 
> version = 1) found.
> Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive Service RELEASE 'subrev 
> 1204 version 0.80.1'
> Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2002-2006 MontaVista 
> Software, Inc and contributors.
> Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
> Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive couldn't open 
> configuration object database component.
> Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive exiting (-13).


My guess is that the cman lcrso file is missing from /usr/libexec/lcrso

-- 
Patrick



From jos at xos.nl  Mon Jul 16 13:42:16 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 16 Jul 2007 15:42:16 +0200
Subject: [Linux-cluster] CS4/CS5 ///// RHEL4/RHEL5
In-Reply-To: <469B6C9C.20006@bull.net>;
	from Alain.Moulle@bull.net on Mon, Jul 16, 2007 at 03:03:24PM
	+0200
References: <469B6C9C.20006@bull.net>
Message-ID: <20070716154216.B28047@xos037.xos.nl>

On Mon, Jul 16, 2007 at 03:03:24PM +0200, Alain Moulle wrote:

> Is there a CS5 planned for RHEL5 ?

There is a cluster suite included with RHEL5 since its launch.

> 1/Is it mandatory on CS5, or is it possible to compile
> a CS4 U5 to work on RHEL5 ?

I don't think the latter is possible, because the kernel modules that
are different between RHEL4 and RHEL5 (and some are gone).

> 2/ when CS5 will  be available ?

Since March 14...?

> 3/ Which are or will be the main enhancements of CS5 versus CS4 U5 ?

I leave this to others to summarize.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From sebastia at l00-bugdead-prods.de  Mon Jul 16 13:59:06 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Mon, 16 Jul 2007 15:59:06 +0200
Subject: [Linux-cluster] AIS Executive couldn't open configuration
	objectdatabase component
Message-ID: <20070716135906.618F347931@l00-bugdead-prods.de>

Hi,

> > 
> > Jul 16 15:10:00 srv4 ccsd[23855]: Starting ccsd 2.00.00:
> > Jul 16 15:10:00 srv4 ccsd[23855]:  Built: Jul 13 2007 13:24:27
> > Jul 16 15:10:00 srv4 ccsd[23855]:  Copyright (C) Red Hat, Inc.  2004  
All 
> > rights reserved.
> > Jul 16 15:10:00 srv4 ccsd[23855]: cluster.conf (cluster name = correo, 
> > version = 1) found.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive Service 
RELEASE 'subrev 
> > 1204 version 0.80.1'
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2002-2006 MontaVista 
> > Software, Inc and contributors.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive couldn't open 
> > configuration object database component.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive exiting (-13).
> 
> 
> My guess is that the cman lcrso file is missing from /usr/libexec/lcrso

no, while trying  to get it running, I already copied the file 
service_cman.lcrso, from /usr/libexec/lcrso to /usr/lib64/lcrso. 

Out of curiosity I copied all files from /usr/lib64/lcrso 
to /usr/libexec/lcrso, and I get past that error, but now I get a new one, 
that I need to investigate further. 

thanks a lot
Sebastian





From bfilipek at crscold.com  Mon Jul 16 14:15:58 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Mon, 16 Jul 2007 09:15:58 -0500
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
References: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>
	<46992CAB.5040509@artegence.com>
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D026CE2F9@SRVEDI.upark.crscold.com>

Maciej,

Your first link does not show RHEL5 as being supported, only RHEL3/4. 

Your second link is for Hitachi Data Systems OEM drivers. My SAN is an EMC unit. 

Thanks anyways. 


Brad Filipek

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Maciej Bogucki
Sent: Saturday, July 14, 2007 3:06 PM
To: linux clustering
Subject: Re: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?

Brad Filipek napisa?(a):
> I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
> of my servers (Dell PowerEdge 1950's) have a QLogic QLE2462 HBA. Does
> anyone have a RHEL5 cluster running with a similar fiber channel setup?
> I am looking for the proper drivers for these HBA's, but Dell, EMC, and
> QLogic are telling me that this configuration is not yet supported so
> there are no drivers yet. I would also like to install SANSurfer if
> possible.
> 

Hello,

Qlogix QLE2462 is supported by Linux [1], and here is Linux driver [2].
If standard RHEL5 doesn't have driver for FC card You could crate it on
another machine and add it(via floppy) at installation time.

[1] - http://www.cybernetech.co.jp/pdf/qle2462.pdf
[2] -
http://support.qlogic.com/support/oem_detail_hds.asp?oemid=84&classid=237

Best Regards
Maciej Bogucki

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.




From bfilipek at crscold.com  Mon Jul 16 14:17:07 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Mon, 16 Jul 2007 09:17:07 -0500
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
References: <BAY105-F31D4EFBD214846DB45BB8CE3FF0@phx.gbl>
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D026CE2FC@SRVEDI.upark.crscold.com>

Mehmet,

I am running RHEL5 AP. That is the OS I need drivers for. 

Brad Filipek


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of mehmet celik
Sent: Sunday, July 15, 2007 5:58 AM
To: linux-cluster at redhat.com
Subject: RE: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?


hi brad, you should look the RHEL Advanced Platform.

Have a nice day..

Mehmet CELIK
Istanbul/TURKEY

>From: "Brad Filipek" <bfilipek at crscold.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
>Date: Fri, 13 Jul 2007 12:21:30 -0500
>
>I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
>of my servers (Dell PowerEdge 1950's) have a QLogic QLE2462 HBA. Does
>anyone have a RHEL5 cluster running with a similar fiber channel setup?
>I am looking for the proper drivers for these HBA's, but Dell, EMC, and
>QLogic are telling me that this configuration is not yet supported so
>there are no drivers yet. I would also like to install SANSurfer if
>possible.
>
>
>
>Thanks for any help.
>
>
>
>
>
>Brad Filipek
>
>
>
>
>
>
>Confidentiality Notice: This message is intended for the use of the 
>individual or entity to which it is addressed and may contain information 
>that is privileged, confidential and exempt from disclosure under 
>applicable law. If the reader of this message is not the intended recipient 
>or the employee or agent responsible for delivering this message to the 
>intended recipient, you are hereby notified that any dissemination, 
>distribution or copying of this communication is strictly prohibited.
>
>If you have received this communication in error, please notify us 
>immediately by email reply or by telephone and immediately delete this 
>message and any attachments.
>


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Need a brain boost? Recharge with a stimulating game. Play now!? 
http://club.live.com/home.aspx?icid=club_hotmailtextlink1

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.




From sebastia at l00-bugdead-prods.de  Mon Jul 16 14:37:12 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Mon, 16 Jul 2007 16:37:12 +0200
Subject: [Linux-cluster] fenced segfault
Message-ID: <20070716143713.2AB9F47C8F@l00-bugdead-prods.de>

Hi,

I am still on openSUSE 10,2, x86_64, using openais-0.80.1-6 (rpm from source 
rpm), and cluster-2.0.0. (self compiled). Kernel is Linux srv4 
2.6.20.15-default #1 SMP Fri Jul 13 12:44:51 CEST 2007 x86_64 x86_64 x86_64 
GNU/Linux

Now, when I run /etc/init.d/cman, for the first time, the fenced segfaults, 
and the cman init script hangs and is waiting for the fenced. When I Ctrl-C 
the init script, and kill the aisexec and the /sbin/ccsd, and then restart 
the init script, then, after some minutes, the fenced is also starting and 
the script ends with a "success".

the following are the logs while starting /etc/init.d/cman


Jul 16 16:19:49 srv4 ccsd[29691]: Starting ccsd 2.00.00:
Jul 16 16:19:49 srv4 ccsd[29691]:  Built: Jul 13 2007 13:24:27
Jul 16 16:19:49 srv4 ccsd[29691]:  Copyright (C) Red Hat, Inc.  2004  All 
rights reserved.
Jul 16 16:19:49 srv4 ccsd[29691]: cluster.conf (cluster name = correo, 
version = 1) found.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] AIS Executive Service 
RELEASE 'subrev 1204 version 0.80.1'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Copyright (C) 2002-2006 
MontaVista Software, Inc and contributors.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Copyright (C) 2006 Red Hat, 
Inc.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Using default multicast address 
of 239.192.25.250
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_cpg 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais cluster closed process group service v1.01'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_cfg 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais configuration service'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_msg 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais message service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_lck 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais distributed locking service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_evt 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais event service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_ckpt 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais checkpoint service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_amf 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais availability management framework B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_clm 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais cluster membership service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_evs 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais extended virtual synchrony service'
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] openais component openais_cman 
loaded.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] Registering service 
handler 'openais CMAN membership service 2.01'
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Token Timeout (10000 ms) 
retransmit timeout (495 ms)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] token hold (386 ms) retransmits 
before loss (20 retrans)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] join (60 ms) send_join (0 ms) 
consensus (4800 ms) merge (200 ms)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] downcheck (1000000 ms) fail to 
recv const (50 msgs)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] seqno unchanged const (30 
rotations) Maximum network MTU 1500
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] window size per rotation (50 
messages) maximum messages per rotation (17 messages)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] send threads (0 threads)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] RRP token expired timeout (495 
ms)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] RRP token problem counter (2000 
ms)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] RRP threshold (10 problem 
count)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] RRP mode set to none.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] heartbeat_failures_allowed (0)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] max_network_delay (50 ms)
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] HeartBeat is Disabled. To 
enable set heartbeat_failures_allowed > 0
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Receive multicast socket recv 
buffer size (262142 bytes).
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Transmit multicast socket send 
buffer size (262142 bytes).
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] The network interface 
[192.168.8.13] is now up.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Created or loaded sequence id 
68.192.168.8.13 for this ring.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] entering GATHER state from 15.
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais extended virtual synchrony service'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais cluster membership service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais availability management framework B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais checkpoint service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais event service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais distributed locking service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais message service B.01.01'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais configuration service'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais cluster closed process group service v1.01'
Jul 16 16:19:51 srv4 openais[29697]: [SERV ] Initialising service 
handler 'openais CMAN membership service 2.01'
Jul 16 16:19:51 srv4 openais[29697]: [CMAN ] CMAN 2.00.00 (built Jul 13 2007 
13:24:30) started
Jul 16 16:19:51 srv4 openais[29697]: [SYNC ] Not using a virtual synchrony 
filter.
Jul 16 16:19:51 srv4 openais[29697]: [MAIN ] AIS Executive Service: started 
and ready to provide service.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Creating commit token because I 
am the rep.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Saving state aru 0 high seq 
received 0
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] entering COMMIT state.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] entering RECOVERY state.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] position [0] member 
192.168.8.13:
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] previous ring seq 68 rep 
192.168.8.13
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] aru 0 high delivered 0 received 
flag 0
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Did not need to originate any 
messages in recovery.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Storing new sequence id for 
ring 48
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] Sending initial ORF token
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] New Configuration:
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] Members Left:
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] Members Joined:
Jul 16 16:19:51 srv4 openais[29697]: [SYNC ] This node is within the primary 
component and will provide service.
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] New Configuration:
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ]    r(0) ip(192.168.8.13)
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] Members Left:
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] Members Joined:
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ]    r(0) ip(192.168.8.13)
Jul 16 16:19:51 srv4 openais[29697]: [SYNC ] This node is within the primary 
component and will provide service.
Jul 16 16:19:51 srv4 openais[29697]: [TOTEM] entering OPERATIONAL state.
Jul 16 16:19:51 srv4 openais[29697]: [CMAN ] quorum regained, resuming 
activity
Jul 16 16:19:51 srv4 openais[29697]: [CLM  ] got nodejoin message 
192.168.8.13
Jul 16 16:19:51 srv4 ccsd[29691]: Initial status:: Quorate
Jul 16 16:19:59 srv4 fenced[29709]: srv5 not a cluster member after 6 sec 
post_join_delay
Jul 16 16:19:59 srv4 kernel: fenced[29709]: segfault at 0000000000000000 rip 
0000000000405e97 rsp 00007fff30ee0b80 error 4


below my cluster.conf file:

<?xml version="1.0"?>
<cluster name="correo" config_version="1">
  <cman two_node="1" expected_votes="1">
</cman>

<clusternodes>

<clusternode name="srv4" nodeid="1" votes="1">
        <fence>
                <method name="single">
                        <device name="ilo_srv4"/>
                </method>
        </fence>
</clusternode>

<clusternode name="srv5" nodeid="1" votes="1">
        <fence>
                <method name="single">
                        <device name="ilo_srv5"/>
                </method>
        </fence>
</clusternode>

</clusternodes>

<fencedevices>
        <fencedevice name="ilo_srv4" agent="fence_ilo" 
ipaddr="192.168.8.180" login="ilo" />
        <fencedevice name="ilo_srv5" agent="fence_ilo" 
ipaddr="192.168.8.181" login="ilo" />
</fencedevices>

</cluster>

any hint what could cause the segfault of the fenced? The ilo boards on the 
two servers are not yet configured, I don't know whether this could cause 
the problem?

kind regards
Sebastian



From andremachado at techforce.com.br  Mon Jul 16 14:52:39 2007
From: andremachado at techforce.com.br (andremachado)
Date: Mon, 16 Jul 2007 7:52:39 -0700
Subject: [Linux-cluster] CLVM, GFS and GNBD scenarios?
In-Reply-To: <862208.89379.qm@web32201.mail.mud.yahoo.com>
References: <862208.89379.qm@web32201.mail.mud.yahoo.com>
Message-ID: <2d51698c91e87270dd4cea9c46a3f17c@localhost>

Hello,
there are gnbd flaws (at specific versions, verify) and clvm2 limitations.
Read carefully [0].
It seems that recent cvs commits were addressing these problems, but i am not sure, nor tested yet. Verify.
CLVM2 currently does not have a working mirroring feature in cluster mode nor snapshoting.
RH does not recomend gnbd internal fencing in failover [1].
Regards.
Andre Felipe Machado

[0] http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch
[1] http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Global_Network_Block_Device/s1-gnbd-mp-sn.html





From rpeterso at redhat.com  Mon Jul 16 14:50:26 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 16 Jul 2007 09:50:26 -0500
Subject: [Linux-cluster] crash during gfs_grow
In-Reply-To: <A5E89E21CE21AF4186E3AC8E368B2A16040617@usvarstyfnexch1.yellowfiber.net>
References: <A5E89E21CE21AF4186E3AC8E368B2A16040617@usvarstyfnexch1.yellowfiber.net>
Message-ID: <1184597426.517.44.camel@technetium.msp.redhat.com>

On Mon, 2007-07-16 at 00:12 -0400, Zak Thompson wrote:
> Hello all, last night we were growing the gfs system and during the
> growth we had a panic and had to bounce the machines.  Now everything
> is back up and the lvm is showing the correct disk size  however when
> we run a df it is still showing the same disk space/usage/  well this
> isn?t good we now lost 1tb of new disk space.  Any suggestions on how
> to get this back?
> 
>  
> 
> Cheers,
> 
> Zak

Hi Zak,

What release of the software were you using when the panic occurred?
The reason I'm asking is because I'm trying to find out if you have
the relatively new fast df feature.  The fast df feature keeps track
of df changes differently than the older/slower df, which is more
accurate.  If you have this feature, you may want to temporarily
disable it and do a df to see if it shows better numbers.  It's fairly
new so at this point I'm going to assume you don't have that feature.

The problem of whether or not the newly added space is usable depends
on the state of the file system and whether the RGs were added
into the file system correctly, and that depends on how far it got
before the system panicked.

I recommend running gfs_fsck on the file system with a recent version
of gfs_fsck.  Recent versions of gfs_fsck are able to check the 
integrity of the RGs and possibly repair them if they're not in order.

If the RGs were not added in, you could probably just do another
gfs_grow to get them added in.

Regards,

Bob Peterson
Red Hat Cluster Suite




From wcheng at redhat.com  Mon Jul 16 15:05:48 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Mon, 16 Jul 2007 11:05:48 -0400
Subject: [Linux-cluster] crash during gfs_grow
In-Reply-To: <1184597426.517.44.camel@technetium.msp.redhat.com>
References: <A5E89E21CE21AF4186E3AC8E368B2A16040617@usvarstyfnexch1.yellowfiber.net>
	<1184597426.517.44.camel@technetium.msp.redhat.com>
Message-ID: <469B894C.6040406@redhat.com>

Bob Peterson wrote:
> What release of the software were you using when the panic occurred?
> The reason I'm asking is because I'm trying to find out if you have
> the relatively new fast df feature.  The fast df feature keeps track
> of df changes differently than the older/slower df, which is more
> accurate.  If you have this feature, you may want to temporarily
> disable it and do a df to see if it shows better numbers.  It's fairly
> new so at this point I'm going to assume you don't have that feature.
>   

The new faster "df" works the other way around. You have to explicitly 
turn it on (via "gfs_tool settune" command). The default is off (i.e. 
the old-slow df is the default).

-- Wendy



From berthiaume_wayne at emc.com  Mon Jul 16 16:07:09 2007
From: berthiaume_wayne at emc.com (berthiaume_wayne at emc.com)
Date: Mon, 16 Jul 2007 12:07:09 -0400
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D026CE2FC@SRVEDI.upark.crscold.com>
References: <BAY105-F31D4EFBD214846DB45BB8CE3FF0@phx.gbl>
	<9C01E18EF3BC2448A3B1A4812EB87D026CE2FC@SRVEDI.upark.crscold.com>
Message-ID: <D364D39DAD21D444BAE2C70919B6280906489A9D@CORPUSMX40A.corp.emc.com>

Hi Brad.

	You should use the QLogic driver, qla2xxx, that arrives with the RHEL 5.0 distribution. This will be the only driver EMC will be supporting. As for your configuration with RHEL 5.0 in an RHCS/GFS cluster, it is in qualification at EMC at this time. 

Regards,
Wayne.

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brad Filipek
Sent: Monday, July 16, 2007 10:17 AM
To: linux clustering
Subject: RE: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?

Mehmet,

I am running RHEL5 AP. That is the OS I need drivers for. 

Brad Filipek


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of mehmet celik
Sent: Sunday, July 15, 2007 5:58 AM
To: linux-cluster at redhat.com
Subject: RE: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?


hi brad, you should look the RHEL Advanced Platform.

Have a nice day..

Mehmet CELIK
Istanbul/TURKEY

>From: "Brad Filipek" <bfilipek at crscold.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
>Date: Fri, 13 Jul 2007 12:21:30 -0500
>
>I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
>of my servers (Dell PowerEdge 1950's) have a QLogic QLE2462 HBA. Does
>anyone have a RHEL5 cluster running with a similar fiber channel setup?
>I am looking for the proper drivers for these HBA's, but Dell, EMC, and
>QLogic are telling me that this configuration is not yet supported so
>there are no drivers yet. I would also like to install SANSurfer if
>possible.
>
>
>
>Thanks for any help.
>
>
>
>
>
>Brad Filipek
>
>
>
>
>
>
>Confidentiality Notice: This message is intended for the use of the 
>individual or entity to which it is addressed and may contain information 
>that is privileged, confidential and exempt from disclosure under 
>applicable law. If the reader of this message is not the intended recipient 
>or the employee or agent responsible for delivering this message to the 
>intended recipient, you are hereby notified that any dissemination, 
>distribution or copying of this communication is strictly prohibited.
>
>If you have received this communication in error, please notify us 
>immediately by email reply or by telephone and immediately delete this 
>message and any attachments.
>


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Need a brain boost? Recharge with a stimulating game. Play now!? 
http://club.live.com/home.aspx?icid=club_hotmailtextlink1

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From sebastia at l00-bugdead-prods.de  Mon Jul 16 16:08:45 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Mon, 16 Jul 2007 18:08:45 +0200
Subject: [Linux-cluster] fenced segfault
Message-ID: <20070716160845.9434147D43@l00-bugdead-prods.de>

Hi,
> 
> <fencedevices>
>         <fencedevice name="ilo_srv4" agent="fence_ilo" 
> ipaddr="192.168.8.180" login="ilo" />
>         <fencedevice name="ilo_srv5" agent="fence_ilo" 
> ipaddr="192.168.8.181" login="ilo" />
> </fencedevices>
> 
> </cluster>
> 
> any hint what could cause the segfault of the fenced? The ilo boards on 
the 
> two servers are not yet configured, I don't know whether this could cause 
> the problem?
> 
I changed the fencedevices in cluster.conf to fence_manual, but with the 
same segmentation fault, so the not configured ilo boards doesnt seem to 
cause the problem.

<fencedevices>
        <fencedevice name="ilo_srv4" agent="fence_manual" nodename="srv4" />
        <fencedevice name="ilo_srv5" agent="fence_manual" nodename="srv5" />
</fencedevices>


> kind regards
> Sebastian




From lpleiman at redhat.com  Mon Jul 16 16:14:59 2007
From: lpleiman at redhat.com (Leo Pleiman)
Date: Mon, 16 Jul 2007 12:14:59 -0400
Subject: [Linux-cluster] Are RHEL5 cluster comms different than RHEL4?
Message-ID: <469B9983.10203@redhat.com>

Hello,

$subject?
Using an IBM bladecenter, we are able to build a RHEL4 cluster of which 
the nodes are in different chassis and different racks using a private 
vlan. When building a RHEL5 cluster, nodes which are NOT in the same 
chassis cannot join a cluster.

Example:

node1 is in chassis 1
node2 is in chassis 3
node3 is in chassis 3

All nodes are using bond0:0 as their public interface with FQDN in 
/etc/hosts.
All nodes are using bond0.999 as the private cluster interface with FQDN 
in hosts and cluster.conf.

Under RHEL4 all nodes join to form a cluster.
Under RHEL5 nodes 2 and 3 form a cluster but node1 never joins the 
cluster. Under RHEL5, when node1 tries to join the cluster there are NO 
messages in /var/log/messages that it is even trying to join the cluster.

Any ideas?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lpleiman.vcf
Type: text/x-vcard
Size: 194 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070716/facbf057/attachment.vcf>

From lhh at redhat.com  Mon Jul 16 17:06:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 16 Jul 2007 13:06:19 -0400
Subject: [Linux-cluster] fence problem on rhel5 and another question
In-Reply-To: <C537B996-78A6-443E-A7A8-10FE68F6687D@kinetikon.com>
References: <C537B996-78A6-443E-A7A8-10FE68F6687D@kinetikon.com>
Message-ID: <20070716170619.GB4423@redhat.com>

On Fri, Jul 13, 2007 at 12:19:04PM +0200, Matteo Catanese wrote:
> Now the other question:
> 
> Granted that  I dont want to install  non-certified, non-rhn- 
> updateable software, the question is: is there something similar to  
> DRBD in rhel5 ?

Red Hat does not ship DRBD.

In theory, GNBD + cluster mirroring should be able to do this, but I do
not know if anyone has tried it yet (?).

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Jul 16 17:07:40 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 16 Jul 2007 13:07:40 -0400
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>
Message-ID: <20070716170740.GC4423@redhat.com>

On Fri, Jul 13, 2007 at 12:21:30PM -0500, Brad Filipek wrote:
> I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
> of my servers (Dell PowerEdge 1950's) have a QLogic QLE2462 HBA. Does
> anyone have a RHEL5 cluster running with a similar fiber channel setup?
> I am looking for the proper drivers for these HBA's, but Dell, EMC, and
> QLogic are telling me that this configuration is not yet supported so
> there are no drivers yet. I would also like to install SANSurfer if
> possible. 

I have an AX100 and it works fine; I can't speak from experience about
the AX150.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From jparsons at redhat.com  Mon Jul 16 17:12:13 2007
From: jparsons at redhat.com (James Parsons)
Date: Mon, 16 Jul 2007 13:12:13 -0400
Subject: [Linux-cluster] fenced segfault
In-Reply-To: <20070716160845.9434147D43@l00-bugdead-prods.de>
References: <20070716160845.9434147D43@l00-bugdead-prods.de>
Message-ID: <469BA6ED.5050602@redhat.com>

Sebastian Reitenbach wrote:
There is no attribute in the conf file named 'ipaddr' for fence_ilo. 
Change it to 'hostname'...you can still use a dotted quad as the 
arg...it's just that the name is not correct.

http://sources.redhat.com/cluster/doc/cluster_schema.html

-J

>Hi,
>  
>
>><fencedevices>
>>        <fencedevice name="ilo_srv4" agent="fence_ilo" 
>>ipaddr="192.168.8.180" login="ilo" />
>>        <fencedevice name="ilo_srv5" agent="fence_ilo" 
>>ipaddr="192.168.8.181" login="ilo" />
>></fencedevices>
>>
>></cluster>
>>
>>any hint what could cause the segfault of the fenced? The ilo boards on 
>>    
>>
>the 
>  
>
>>two servers are not yet configured, I don't know whether this could cause 
>>the problem?
>>
>>    
>>
>I changed the fencedevices in cluster.conf to fence_manual, but with the 
>same segmentation fault, so the not configured ilo boards doesnt seem to 
>cause the problem.
>
><fencedevices>
>        <fencedevice name="ilo_srv4" agent="fence_manual" nodename="srv4" />
>        <fencedevice name="ilo_srv5" agent="fence_manual" nodename="srv5" />
></fencedevices>
>
>
>  
>
>>kind regards
>>Sebastian
>>    
>>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>




From lhh at redhat.com  Mon Jul 16 17:15:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 16 Jul 2007 13:15:19 -0400
Subject: [Linux-cluster] GNBD/GFS question
In-Reply-To: <1184436855.29495.28.camel@com30.ser.aps.anl.gov>
References: <1184436855.29495.28.camel@com30.ser.aps.anl.gov>
Message-ID: <20070716171519.GD4423@redhat.com>

On Sat, Jul 14, 2007 at 01:14:15PM -0500, James Fait wrote:
> 
> This is the test that is currently causing the problem.
>         I kill the network link for cluster28(com28) using the network
>         switch to temporarily disable the link.
>         
>         Cluster28 is fenced successfully. One of the other nodes handles
>         the gfs cleanup.
>         
>         The network link is restored.
>         
>         Cluster28 attempts an orderly shutdown, and hangs on umount of
>         gfs.  Later, it is power cycled to force reboot.
>         
>         Cluster28 rejoins the cluster, and attempts to mount gfs.  The
>         gnbd_import command fails with the error messages:
>                 gnbd_recvd: ERROR login refused by the server,
>                 quitting : Operation not permitted
>                 gnbd_import: ERROR gnbd_recvd failed
>         
>         Cluster29 reports the error:
>                 com29 gnbd_serv[4794]: ERROR [gserv.c:468] client
>                 cluster28 is banned. Canceling login

cluster28 is still fenced.  You need to unfence (unban) it on the
server.

IIRC you can do it with gnbd_export -a (?) but I might be wrong.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Jul 16 17:17:58 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 16 Jul 2007 13:17:58 -0400
Subject: [Linux-cluster] CS4/CS5 ///// RHEL4/RHEL5
In-Reply-To: <469B6C9C.20006@bull.net>
References: <469B6C9C.20006@bull.net>
Message-ID: <20070716171758.GE4423@redhat.com>

On Mon, Jul 16, 2007 at 03:03:24PM +0200, Alain Moulle wrote:
> Hi
> 
> I would like to know :
> 
> Is there a CS5 planned for RHEL5 ?

It's included as part of Advanced Platform now rather than a separate, 
layered product.

> If so :
> 1/Is it mandatory on CS5, or is it possible to compile
> a CS4 U5 to work on RHEL5 ?

You can't build RHCS4 on RHEL5 AFAIK.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From sdake at redhat.com  Mon Jul 16 18:41:52 2007
From: sdake at redhat.com (Steven Dake)
Date: Mon, 16 Jul 2007 11:41:52 -0700
Subject: [Linux-cluster] AIS Executive couldn't open configuration
	object database component
In-Reply-To: <469B7369.90703@redhat.com>
References: <20070716132229.2E07247C77@l00-bugdead-prods.de>
	<469B7369.90703@redhat.com>
Message-ID: <1184611312.8845.22.camel@shih.broked.org>

lcrso files definitely belong in /usr/libexec/lcrso.

Show us which error you get when you move the files to that directory.

Regards
-steve


On Mon, 2007-07-16 at 14:32 +0100, Patrick Caulfield wrote:
> Sebastian Reitenbach wrote:
> > Hi,
> > 
> > I wanted to try out gfs2 instead of ocfs2 because of the acl's that it 
> > supports, but I do not get it running. I am on openSUSE 10.2, x86_64.
> > 
> > my kernel is 
> > Linux srv4 2.6.20.15-default #1 SMP Fri Jul 13 12:44:51 CEST 2007 x86_64 
> > x86_64 x86_64 GNU/Linux
> > 
> > I have openais rpm's created from source rpm's:
> > openais-devel-0.80.1-6
> > openais-0.80.1-6
> > 
> > and I use cluster-2.00.00, just compiled from source. My *.lcrso files are 
> > in /usr/lib64/lcrso/ and /usr/libexec/lcrso/. The LD_LIBRARY_PATH and the 
> > PATH variable have both pathes included.
> > 
> > when I start /etc/init.d/cman then I see the following:
> > /etc/init.d/cman start
> > Starting cluster:
> >    Loading modules... done
> >    Mounting configfs... done
> >    Starting ccsd... done
> >    Starting cman... failed
> > /usr/sbin/cman_tool: aisexec daemon didn't start
> > /etc/init.d/cman: line 413: failure: command not found
> > 
> > and this is /var/log/messages:
> > 
> > Jul 16 15:10:00 srv4 ccsd[23855]: Starting ccsd 2.00.00:
> > Jul 16 15:10:00 srv4 ccsd[23855]:  Built: Jul 13 2007 13:24:27
> > Jul 16 15:10:00 srv4 ccsd[23855]:  Copyright (C) Red Hat, Inc.  2004  All 
> > rights reserved.
> > Jul 16 15:10:00 srv4 ccsd[23855]: cluster.conf (cluster name = correo, 
> > version = 1) found.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive Service RELEASE 'subrev 
> > 1204 version 0.80.1'
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2002-2006 MontaVista 
> > Software, Inc and contributors.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive couldn't open 
> > configuration object database component.
> > Jul 16 15:10:01 srv4 aisexec: [MAIN ] AIS Executive exiting (-13).
> 
> 
> My guess is that the cman lcrso file is missing from /usr/libexec/lcrso
> 



From dbrieck at gmail.com  Mon Jul 16 20:02:14 2007
From: dbrieck at gmail.com (David Brieck Jr.)
Date: Mon, 16 Jul 2007 16:02:14 -0400
Subject: [Linux-cluster] LVS/Piranha Failover Problems
Message-ID: <8c1094290707161302k7b3cdc0atb1999e129e8d3fd9@mail.gmail.com>

We have been using Piranha for about 7 months now and we've
consistently had a problem failing over. All of our systems are 4.4,
but we're planning 4.5 and 5 in the near future. We have two load
balancers, one is supposed to fail over to the other. Behind the load
balancers we have 3 real servers. The load balancer has 12 ip
addresses that it uses to send traffic over to the real servers. Each
real server is configured the same way.

The problem we have is whenever we fail over the master LB to the
slave, one of the real servers refuses to use the fail over as it's
gateway causing outages for some people. It's configured exactly as it
should be, which is just like all the others but it still refuses to
function.

Another problem we have is that not all of the virtual server's public
IP addresses are loaded when it fails over and even on a restart of
pulse. Everything would indicate they are loaded by pulse, but they
don't show up in 'ip addr list'. If I manually run the ifconfig line
for each missing public IP everything works properly. I was originally
told this problem had to do with the channel bonding we were using,
however we have since stopped using it, but the problem persists.

Any help would be greatly appreciated.

David



From beres.laszlo at sys-admin.hu  Mon Jul 16 20:43:27 2007
From: beres.laszlo at sys-admin.hu (BERES Laszlo)
Date: Mon, 16 Jul 2007 22:43:27 +0200
Subject: [Linux-cluster] fence problem on rhel5 and another question
In-Reply-To: <20070716170619.GB4423@redhat.com>
References: <C537B996-78A6-443E-A7A8-10FE68F6687D@kinetikon.com>
	<20070716170619.GB4423@redhat.com>
Message-ID: <469BD86F.1040108@sys-admin.hu>

Lon Hohberger ?rta:

> In theory, GNBD + cluster mirroring should be able to do this, but I do
> not know if anyone has tried it yet (?).

Will RHEL 5.1 contain cmirror? We didn't find cmirror package in RHEL 5.

-- 
B?RES L?szl?	 RHCE, RHCX
senior IT engineer, trainer



From bfilipek at crscold.com  Mon Jul 16 21:28:42 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Mon, 16 Jul 2007 16:28:42 -0500
Subject: [Linux-cluster] suggestions on how to configure two services
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D026CE477@SRVEDI.upark.crscold.com>

Hello group, 

 

I have a two node cluster connected to a fibre channel SAN.  I need to
run SAMBA and SSH as clustered services. Both of these services will
need to access my SAN via a GFS mount. 

 

Should I manually mount the GFS file system individually on both nodes
by running "mount -t gfs /dev/sdb2 /mountpoint"? Or can I add the GFS
mount as a resource in Cluster Suite, and then assign that resource to
both services? If I assign it to both services, what happens when one
service needs to restart? Will the GFS mount be broken, which in turn
will cause the other service to fail? 

 

Thanks for any suggestions.

 

Brad Filipek



 


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070716/18e2ccfa/attachment.htm>

From hal_bg at yahoo.com  Mon Jul 16 21:43:56 2007
From: hal_bg at yahoo.com (Hal)
Date: Mon, 16 Jul 2007 14:43:56 -0700 (PDT)
Subject: [Linux-cluster] CLVM, GFS and GNBD scenarios?
In-Reply-To: <2d51698c91e87270dd4cea9c46a3f17c@localhost>
Message-ID: <716240.91101.qm@web32205.mail.mud.yahoo.com>

What about iSCSI? 
if GNBD will not work?

--- andremachado <andremachado at techforce.com.br> wrote:

> Hello,
> there are gnbd flaws (at specific versions, verify) and clvm2 limitations.
> Read carefully [0].
> It seems that recent cvs commits were addressing these problems, but i am not
> sure, nor tested yet. Verify.
> CLVM2 currently does not have a working mirroring feature in cluster mode nor
> snapshoting.
> RH does not recomend gnbd internal fencing in failover [1].
> Regards.
> Andre Felipe Machado
> 
> [0]
>
http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch
> [1]
>
http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Global_Network_Block_Device/s1-gnbd-mp-sn.html
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



 
____________________________________________________________________________________
The fish are biting. 
Get more visitors on your site using Yahoo! Search Marketing.
http://searchmarketing.yahoo.com/arp/sponsoredsearch_v2.php



From pheeh at nodeps.org  Mon Jul 16 22:50:18 2007
From: pheeh at nodeps.org (pheeh at nodeps.org)
Date: Mon, 16 Jul 2007 15:50:18 -0700 (MST)
Subject: [Linux-cluster] Mirror Volume Pairs
Message-ID: <43776.12.152.67.72.1184626218.squirrel@12.152.67.72>

I am purchasing the equipment to attempt a configuration I found here:

http://www.redhat.com/magazine/008jun05/features/gfs/#fig-mirroring

How does GNBD able to export the mirrored drives?  How would the mirror be
created?  If gnbd exports a local device, lets call it /dev/md0, how does
the local drive actually get mirrored from the drive of the second
machine?

I haven't been able to locate too much information on this mirror volume
pairs, is there a document anywhere that could explain the setup?

The goal is to get two machines, each with two 500 Gig drives to use gnbd
multipath to export their local drives as a single source...thus there
would be failover just as described in the document.



From joseparrella at gmail.com  Tue Jul 17 00:31:49 2007
From: joseparrella at gmail.com (=?windows-1252?Q?Jos=E9_Miguel_Parrella_Romero?=)
Date: Mon, 16 Jul 2007 20:31:49 -0400
Subject: [Linux-cluster] suggestions on how to configure two services
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D026CE477@SRVEDI.upark.crscold.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D026CE477@SRVEDI.upark.crscold.com>
Message-ID: <469C0DF5.7030705@gmail.com>

Brad Filipek escribi?:
> Should I manually mount the GFS file system individually on both nodes
> by running ?mount ?t gfs /dev/sdb2 /mountpoint?? Or can I add the GFS

Hello, Brad:

I'm currently running a two-node mail cluster using GFS, and while I'm
able to do mount -t, I keep the filesystem entries in /etc/fstab so I
keep standard options such as noatime,nodev,nosuid,noexec as well as the
 mount on boot option. If you add _netdev to your fstab entries, you'll
make sure that your partitions are mounted on boot, which is generally a
Good Thing (TM).

Make sure to configure your cluster in the two-node mode, which is done
by adding the twonode="1" and expectedvotes="1" arguments to your cman
declaration in your cluster.conf. This will make the cluster quorate
even with one node up.

If I understand correctly, when one of your nodes reboot, if you're
using a sane initscripts configuration, your partitions should shutdown
as well as other daemons such as fenced, clvmd, cmand and ccsd. This
will tell your other node that its partner is going down and it will
keep working just as usual. I haven't tried this in my actual
configuration, but I can start and stop cman and the other node keeps
working OK.

However, if one of the nodes goes down suddenly, the other node will
stall its writing to your clustered filesystem until you acknowledge the
fencing (that is, assuring that the shutdown node won't write to the
shared storage) either by using the manual method (fence_ack_manual) or
by interacting with hardware such as a UPS. The cluster documentation
should provide enough information on available methods and configuration.

HTH,
Jose



From jprats at cesca.es  Tue Jul 17 06:30:52 2007
From: jprats at cesca.es (Jordi Prats)
Date: Tue, 17 Jul 2007 08:30:52 +0200
Subject: [Linux-cluster] rgmanager
Message-ID: <469C621C.6060902@cesca.es>

Hi,
I've been searching for some documentation about the scripts used by 
resource manager, but I found nothing. I've been using the start,stop 
and status options on my scripts so if I want to move a service it stops 
it and restarts it on the other node. There's any standard way to do 
relocate (move the service)?

I've seen that the script vm.sh on cluster-2.0 sources have a "migrate" 
this will do the job? (On the script is not implemented) There's any 
more options?

Sorry for that simple question, but I found no documentation about this 
on http://sourceware.org/cluster/

Thanks,
Jordi



From sebastia at l00-bugdead-prods.de  Tue Jul 17 06:33:25 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 17 Jul 2007 08:33:25 +0200
Subject: [Linux-cluster] fenced segfault
Message-ID: <20070717063325.77AF547E35@l00-bugdead-prods.de>

Hi James,

James Parsons <jparsons at redhat.com> wrote: 
> Sebastian Reitenbach wrote:
> There is no attribute in the conf file named 'ipaddr' for fence_ilo. 
> Change it to 'hostname'...you can still use a dotted quad as the 
> arg...it's just that the name is not correct.
> 
thanks for your suggestion, but the fenced still segfaults when I start 
cman. I found the man page of fence_ilo, I assume the section about the 
parameters listed in the STDIN PARAMETERS I can add in the cluster.conf file 
as the parameter to the fence_agent? 

I also removed the <fencedevices> section in the cluster.conf file 
completely, but the fenced still segfaults, so in my eyes, my configuration 
file doesn't seem to be the problem.

I recognized the short time delay between these two log statements in the 
messages file:
Jul 17 08:01:20 srv4 fenced[32601]: srv5 not a cluster member after 6 sec 
post_join_delay
Jul 17 08:01:21 srv4 kernel: fenced[32601]: segfault at 0000000000000000 rip 
0000000000405e97 rsp 00007fff818454e0 error 4

I only started the srv4, but not srv5, so the first message was right, it 
detected the not running srv5 node. Therefore I tried to start both at the 
same time, and the fenced did not crash anymore. But now the init script of 
the cman hangs forever at "Starting fencing..." 

I have these processes running, since 20 minutes:
srv4:/usr/src/cluster-2.00.00 # ps ax | grep fence
32688 ?        Ss     0:00 /sbin/fenced
32700 pts/1    S+     0:00 /sbin/fence_tool -w -t 300 join
32733 pts/3    S+     0:00 grep fence

when I make a tcpdump on node srv4, I see a lot of UDP packets going out to 
srv5:5404 and to 239.192.25.250.5405.
when I make the same on srv5, I see a lot of UDP packets going to srv4.
But I do not see packets from srv5 arriving at srv4, and vice versa, despite 
both hosts can ping each other and the firewall is deactivated.

Anybody has an idea what still is wrong with my setup?

kind regards
Sebastian



From sebastia at l00-bugdead-prods.de  Tue Jul 17 06:40:30 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 17 Jul 2007 08:40:30 +0200
Subject: [Linux-cluster] fencing via ilo over ssh
Message-ID: <20070717064031.2B1C747E2D@l00-bugdead-prods.de>

Hi,

as far as I read the manual page of fence_ilo, it seems it only uses the 
https access to the ilo boards. Is it somehow possible to fence a host via 
ssh to the ilo board? ssh would allow me to use (passwordless) ssh keys.

kind regards
Sebastian



From Arne.Brieseneck at vodafone.com  Tue Jul 17 07:33:26 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Tue, 17 Jul 2007 09:33:26 +0200
Subject: [Linux-cluster] fencing via ILOM board for SUN servers
In-Reply-To: <20070717064031.2B1C747E2D@l00-bugdead-prods.de>
Message-ID: <E67F1468BF7A4C418D874810215A377EAFA378@EITO-MBX01.internal.vodafone.com>


Hi,

Has anyone already some experiences if a fencing would be possible over
fence_ilo on SUN X4100 servers? The documentation mention only HP
ones...


Kind Regards
Arne



From maciej.bogucki at artegence.com  Tue Jul 17 08:47:50 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Tue, 17 Jul 2007 10:47:50 +0200
Subject: [Linux-cluster] Are RHEL5 cluster comms different than RHEL4?
In-Reply-To: <469B9983.10203@redhat.com>
References: <469B9983.10203@redhat.com>
Message-ID: <469C8236.5040607@artegence.com>

Leo Pleiman napisa?(a):
> Hello,
> 
> $subject?
> Using an IBM bladecenter, we are able to build a RHEL4 cluster of which
> the nodes are in different chassis and different racks using a private
> vlan. When building a RHEL5 cluster, nodes which are NOT in the same
> chassis cannot join a cluster.
> 
> Example:
> 
> node1 is in chassis 1
> node2 is in chassis 3
> node3 is in chassis 3
> 
> All nodes are using bond0:0 as their public interface with FQDN in
> /etc/hosts.
> All nodes are using bond0.999 as the private cluster interface with FQDN
> in hosts and cluster.conf.
> 
> Under RHEL4 all nodes join to form a cluster.
> Under RHEL5 nodes 2 and 3 form a cluster but node1 never joins the
> cluster. Under RHEL5, when node1 tries to join the cluster there are NO
> messages in /var/log/messages that it is even trying to join the cluster.
> 
> Any ideas?

Hello,

First of all check if all nodes are in the same subnet, because cman
send data to brodcast address and udp/6809. I suppose that it is the
problem in Your case. You could use tcpdump to check traffic on
interfece bond0:0 and bond0.999.

Best Regards
Maciej Bogucki



From maciej.bogucki at artegence.com  Tue Jul 17 09:41:24 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Tue, 17 Jul 2007 11:41:24 +0200
Subject: [Linux-cluster] Mirror Volume Pairs
In-Reply-To: <43776.12.152.67.72.1184626218.squirrel@12.152.67.72>
References: <43776.12.152.67.72.1184626218.squirrel@12.152.67.72>
Message-ID: <469C8EC4.50601@artegence.com>

> I am purchasing the equipment to attempt a configuration I found here:
> 
> http://www.redhat.com/magazine/008jun05/features/gfs/#fig-mirroring
> 
> How does GNBD able to export the mirrored drives?  How would the mirror be
> created?  If gnbd exports a local device, lets call it /dev/md0, how does
> the local drive actually get mirrored from the drive of the second
> machine?
> 
> I haven't been able to locate too much information on this mirror volume
> pairs, is there a document anywhere that could explain the setup?
> 
> The goal is to get two machines, each with two 500 Gig drives to use gnbd
> multipath to export their local drives as a single source...thus there
> would be failover just as described in the document.

Hello,

Use drbd [1] to mirror two block device via network. Then use GNBD or
iSCSI to export DRBD device to group of servers. If You need automatic
failover You could use heartbeat [2] or RHCS [3]

[1] - http://www.drbd.org/
[2] - http://www.linux-ha.org/
[3] - http://www.redhat.com/software/rha/cluster/

Best Regards
Maciej Bogucki



From huangxiong at uit.com.cn  Tue Jul 17 10:03:39 2007
From: huangxiong at uit.com.cn (Huang Xiong)
Date: Tue, 17 Jul 2007 18:03:39 +0800
Subject: [Linux-cluster] Questions about active-active Samba
Message-ID: <200707171803.39966.huangxiong@uit.com.cn>

Hello,

This thread is long, please pay some patience.

I am building active-active Samba across two nodes, 

nodes(both installed RHEL4.5):
--------------------------
kaka1: 192.168.3.52
kaka2: 192.168.3.249

and here's the "/etc/cluster/cluster.conf":
---------------------------
<cluster alias="seedorf" config_version="159" name="seedorf">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="kaka1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="NPS" nodename="kaka1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="kaka2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="NPS" nodename="kaka2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_manual" name="NPS"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="failover-1" ordered="1">
                                <failoverdomainnode name="kaka1" 
priority="1"/>
                                <failoverdomainnode name="kaka2" 
priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="failover-2" ordered="1">
                                <failoverdomainnode name="kaka1" 
priority="2"/>
                                <failoverdomainnode name="kaka2" 
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <clusterfs device="/dev/milan/mirror" 
force_unmount="0" fsid="37802" fstype="gfs" mountpoint="/nfsdata" 
name="phillip_gfs" options="acl"/>
                        <smb name="samba_1" workgroup="samba_test"/>
                        <smb name="samba_2" workgroup="samba_test"/>
                        <script file="/etc/init.d/smb" name="smb_script"/>
                        <ip address="192.168.3.143" monitor_link="1"/>
                        <ip address="192.168.3.150" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="failover-1" name="smb-1" 
recovery="relocate">
                        <smb ref="samba_1">
                                <clusterfs ref="phillip_gfs"/>
                                <script ref="smb_script"/>
                        </smb>
                        <ip ref="192.168.3.143"/>
                </service>
                <service autostart="1" domain="failover-2" name="smb-2" 
recovery="relocate">
                       <smb ref="samba_2">
                                <clusterfs ref="phillip_gfs"/>
                                <script ref="smb_script"/>
                        </smb>
                        <ip ref="192.168.3.150"/>
                </service>
        </rm>
</cluster>
-------------------------------------------

When these two nodes are both running, there will automatically 
create /etc/samba/smb.conf.samba_1 in kaka1, and /etc/samba/smb.conf.samba_2 
in kaka2:

On kaka1:
--------------------------
[root at kaka1 samba]# cat smb.conf.samba_1 | grep -v "#"
[global]
        workgroup = samba_test
        pid directory = /var/run/samba/samba_1
        lock directory = /var/cache/samba/samba_1
        log file = /var/log/samba/%m.log
        encrypt passwords = yes
        bind interfaces only = yes
        netbios name = samba_1
        interfaces = 192.168.3.143
[test]
        public = yes
        path = /nfsdata
        read only = no
[root at kaka1 samba]# scp smb.conf.samba_1 kaka2:/etc/samba/

On kaka2:
---------------------------
[root at kaka2 samba]# cat smb.conf.samba_2 |grep -v "#"
[global]
        workgroup = samba_test
        pid directory = /var/run/samba/samba_2
        lock directory = /var/cache/samba/samba_2
        log file = /var/log/samba/%m.log
        encrypt passwords = yes
        bind interfaces only = yes
        netbios name = samba_2
        interfaces = 192.168.3.150
[test2]
        public = yes
        path = /nfsdata
        read only = no
[root at kaka2 samba]# scp smb.conf.samba_2 kaka1:/etc/samba/


Now, reboot the nodes and check the cluster status:
---------------------------------
[root at kaka2 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  kaka1                                    Online, rgmanager
  kaka2                                    Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  smb-1                kaka1                          started
  smb-2                kaka2                          started


and I can see the float IP(s) has been assigned:
----------------------------------
On kaka1:

[root at kaka1 ~]# ip addr list
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:e8:11:a1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.52/24 brd 192.168.3.255 scope global eth0
    inet 192.168.3.143/32 scope global eth0
    inet6 fe80::20c:29ff:fee8:11a1/64 scope link
       valid_lft forever preferred_lft forever
3: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0

On kaka2:

[root at kaka2 ~]# ip addr list
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:24:0c:72 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.249/24 brd 192.168.3.255 scope global eth0
    inet 192.168.3.150/32 scope global eth0
    inet6 fe80::20c:29ff:fe24:c72/64 scope link
       valid_lft forever preferred_lft forever
3: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0


At this point, poweroff the "kaka1", and kaka1's original float 
IP(192.168.3.143) would be appended to kaka2:
-------------------------------------------------
[root at kaka2 ~]# ip addr list
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:24:0c:72 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.249/24 brd 192.168.3.255 scope global eth0
    inet 192.168.3.150/32 scope global eth0
    inet 192.168.3.143/32 scope global eth0
    inet6 fe80::20c:29ff:fe24:c72/64 scope link
       valid_lft forever preferred_lft forever
3: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
 
Hmm, it seems the samba services still keep running well, and the clients 
accessing "192.168.3.143" do not feel interrupt. 
---------------------------------------
[root at kaka2 ~]# clustat
Member Status: Quorate
  Member Name                              Status
  ------ ----                              ------
  kaka1                                    Offline
  kaka2                                    Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  smb-1                kaka2                          started
  smb-2                kaka2                          started



However, when I power on kaka1, the trouble happens, not only "192.168.3.143" 
would be removed, but also kaka2 lost its original float IP "192.168.3.150". 
There're below errors in "/var/log/messages" on kaka2:
---------------------------------------------
[root at kaka2 ~] # tail -f /var/log/messages

Jul 17 17:49:24 kaka2 kernel: CMAN: node kaka1 rejoining
Jul 17 17:49:33 kaka2 clurgmgrd[3393]: <info> Magma Event: Membership Change
Jul 17 17:49:33 kaka2 clurgmgrd[3393]: <info> State change: kaka1 UP
Jul 17 17:49:35 kaka2 clurgmgrd[3393]: <notice> Stopping service smb-1
Jul 17 17:49:36 kaka2 clurgmgrd: [3393]: <info> Removing IPv4 address 
192.168.3.143 from eth0
Jul 17 17:49:44 kaka2 clurgmgrd: [3393]: <info> Executing /etc/init.d/smb 
status
Jul 17 17:49:46 kaka2 clurgmgrd: [3393]: <info> Executing /etc/init.d/smb stop
Jul 17 17:49:46 kaka2 smb: smbd shutdown succeeded
Jul 17 17:49:46 kaka2 nmbd[4571]: [2007/07/17 17:49:46, 0] 
nmbd/nmbd.c:terminate(56)
Jul 17 17:49:46 kaka2 nmbd[4571]:   Got SIGTERM: going down...
Jul 17 17:49:46 kaka2 nmbd[4571]: [2007/07/17 17:49:46, 0] 
libsmb/nmblib.c:send_udp(790)
Jul 17 17:49:46 kaka2 nmbd[4571]:   Packet send failed to 192.168.3.255(138) 
ERRNO=Invalid argument
Jul 17 17:49:46 kaka2 smb: nmbd shutdown succeeded
Jul 17 17:49:47 kaka2 clurgmgrd: [3393]: <info> Stopping Samba 
instance "samba_1"
Jul 17 17:49:47 kaka2 nmbd[6736]: [2007/07/17 17:49:47, 0] 
nmbd/nmbd.c:terminate(56)
Jul 17 17:49:47 kaka2 nmbd[6736]:   Got SIGTERM: going down...
Jul 17 17:49:47 kaka2 nmbd[6736]: [2007/07/17 17:49:47, 0] 
libsmb/nmblib.c:send_udp(790)
Jul 17 17:49:47 kaka2 nmbd[6736]:   Packet send failed to 192.168.3.255(138) 
ERRNO=Invalid argument
Jul 17 17:49:47 kaka2 clurgmgrd[3393]: <notice> Service smb-1 is stopped
Jul 17 17:50:14 kaka2 clurgmgrd: [3393]: <err> share_start_stop: nmbd for 
service  died!
Jul 17 17:50:14 kaka2 clurgmgrd[3393]: <notice> status on smb:samba_2 returned 
255 (unspecified)
Jul 17 17:50:14 kaka2 clurgmgrd[3393]: <notice> Stopping service smb-2
Jul 17 17:50:14 kaka2 clurgmgrd: [3393]: <info> Removing IPv4 address 
192.168.3.150 from eth0
Jul 17 17:50:15 kaka2 nmbd[4488]: [2007/07/17 17:50:15, 0] 
lib/interface.c:load_interfaces(220)
Jul 17 17:50:15 kaka2 nmbd[4488]:   WARNING: no network interfaces found
Jul 17 17:50:15 kaka2 nmbd[4488]: [2007/07/17 17:50:15, 0] 
nmbd/nmbd.c:reload_interfaces(265)
Jul 17 17:50:15 kaka2 nmbd[4488]:   reload_interfaces: No subnets to listen 
to. Shutting down...
Jul 17 17:50:24 kaka2 clurgmgrd: [3393]: <info> Executing /etc/init.d/smb stop
Jul 17 17:50:24 kaka2 smb: smbd shutdown failed
Jul 17 17:50:24 kaka2 smb: nmbd shutdown failed
Jul 17 17:50:24 kaka2 clurgmgrd: [3393]: <err> script:smb_script: stop 
of /etc/init.d/smb failed (returned 1)
Jul 17 17:50:24 kaka2 clurgmgrd[3393]: <notice> stop on script:smb_script 
returned 1 (generic error)
Jul 17 17:50:24 kaka2 clurgmgrd[3393]: <crit> #12: RG smb-2 failed to stop; 
intervention required
Jul 17 17:50:24 kaka2 clurgmgrd[3393]: <notice> Service smb-2 is failed


[root at kaka2 ~]# ip addr list
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:24:0c:72 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.249/24 brd 192.168.3.255 scope global eth0
    inet6 fe80::20c:29ff:fe24:c72/64 scope link
       valid_lft forever preferred_lft forever
3: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0

[root at kaka2 ~]# clustat
Member Status: Quorate
  Member Name                              Status
  ------ ----                              ------
  kaka1                                    Online, rgmanager
  kaka2                                    Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  smb-1                kaka1                          started
  smb-2                (kaka2)                        failed



According to active-active samba cluster, every samba service could ensure 
running and must be able to failover to others when it fails.  While on my 
case, when kaka1 power on again, the samba service "smb-2" on Kaka2 failed 
and the float IP has also been removed.

Would you please help me fix this issue? Any suggestion would be appreciated. 


Regards,
Phillip

 



From breeves at redhat.com  Tue Jul 17 10:22:04 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Tue, 17 Jul 2007 11:22:04 +0100
Subject: [Linux-cluster] RHEL5 compatibility with EMC AX150 SAN?
In-Reply-To: <20070716170740.GC4423@redhat.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D026CE17B@SRVEDI.upark.crscold.com>
	<20070716170740.GC4423@redhat.com>
Message-ID: <469C984C.8090608@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lon Hohberger wrote:
> On Fri, Jul 13, 2007 at 12:21:30PM -0500, Brad Filipek wrote:
>> I am trying to cluster two RHEL5 nodes to an EMC AX-150 SAN device. Both
>> of my servers (Dell PowerEdge 1950's) have a QLogic QLE2462 HBA. Does
>> anyone have a RHEL5 cluster running with a similar fiber channel setup?
>> I am looking for the proper drivers for these HBA's, but Dell, EMC, and
>> QLogic are telling me that this configuration is not yet supported so
>> there are no drivers yet. I would also like to install SANSurfer if
>> possible. 
> 
> I have an AX100 and it works fine; I can't speak from experience about
> the AX150.
> 

AX150's working fine for me here. I do all my multipath testing on it.

Regards,
Bryn.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGnJhM6YSQoMYUY94RAhvsAKDNvXi9TESg7IfiPINCsgF0dXI5mwCgotp1
nSoOMma0atEhB3Y6+YN8cwM=
=udqf
-----END PGP SIGNATURE-----



From sebastia at l00-bugdead-prods.de  Tue Jul 17 10:56:47 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 17 Jul 2007 12:56:47 +0200
Subject: [Linux-cluster] fencing via ilo over ssh
Message-ID: <20070717105647.B900F47E51@l00-bugdead-prods.de>

someone privately wrote: 
> typically you can telnet to the ilo board's ip address.  it should give
> you a nice text based interface to control the power of the node.
> 
I have written a stonith script for linux-ha to use ssh , I assume the 
fence_ scripts are working in a similar way. If sth. like this does not 
exists, then I think I could create one. Is it somewhere documented how 
these are working?

kind regards
Sebastian



From Arne.Brieseneck at vodafone.com  Tue Jul 17 10:58:59 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Tue, 17 Jul 2007 12:58:59 +0200
Subject: [Linux-cluster] fencing via ilo over ssh
In-Reply-To: <20070717105647.B900F47E51@l00-bugdead-prods.de>
Message-ID: <E67F1468BF7A4C418D874810215A377EAFA3EF@EITO-MBX01.internal.vodafone.com>

Do you have this script somewhere available as an example? 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sebastian
Reitenbach
Sent: Dienstag, 17. Juli 2007 12:57
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] fencing via ilo over ssh

someone privately wrote: 
> typically you can telnet to the ilo board's ip address.  it should 
> give you a nice text based interface to control the power of the node.
> 
I have written a stonith script for linux-ha to use ssh , I assume the
fence_ scripts are working in a similar way. If sth. like this does not
exists, then I think I could create one. Is it somewhere documented how
these are working?

kind regards
Sebastian

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From sebastia at l00-bugdead-prods.de  Tue Jul 17 14:08:49 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 17 Jul 2007 16:08:49 +0200
Subject: [Linux-cluster] fencing via ilo over ssh
Message-ID: <20070717140849.8A55A47EDE@l00-bugdead-prods.de>

"Brieseneck, Arne, VF-Group" <Arne.Brieseneck at vodafone.com> wrote: 
> Do you have this script somewhere available as an example? 
> 

It's appended.

kind regards
Sebastian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ilo
Type: application/octet-stream
Size: 5763 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070717/86cf1649/attachment.obj>

From sebastia at l00-bugdead-prods.de  Tue Jul 17 14:36:25 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 17 Jul 2007 16:36:25 +0200
Subject: [Linux-cluster] mount: not in default fence domain
Message-ID: <20070717143625.BC44A47EDC@l00-bugdead-prods.de>

Hi,

I am still on my openSUSE 10.2, x86_64, cluster-2.0.0. and openais 0.80.1 
and kernel version 2.6.20.15.

I got so far, that my two node cluster tries to start when i issue 
a /etc/init.d/cman start. When I only start one node, the fenced crashes. 
But when I start both at the same time, the fenced stays alive, as mentioned 
in an earlier mail in another thread. After about half an hour, I thought I 
try mounting the GFS2 partition. The mount process hung on both machines 
too. Then I killed the cman init script on one machine. The other detected 
this, but then the mount commands ended too, with the following error 
message:

Jul 17 15:28:06 srv5 gfs_controld[22012]: mount: not in default fence domain
Jul 17 15:28:35 srv5 gfs_controld[22012]: mount: not in default fence domain

the fenced was running, but no fence_ilo, but I am also not sure, whether 
this one has to run or not.

what does the error message above is trying to tell me? Can I fix this, or 
will it "fix itself", when I am getting the cman script starting correctly?

kind regards
Sebastian

below my cluster.conf

<?xml version="1.0"?>
<cluster name="correo" config_version="1">
  <cman two_node="1" expected_votes="1">
</cman>

<clusternodes>

<clusternode name="srv4" nodeid="1" votes="1">
        <fence>
                <method name="single">
                        <device name="ilo_srv4"/>
                </method>
        </fence>
</clusternode>

<clusternode name="srv5" nodeid="1" votes="1">
        <fence>
                <method name="single">
                        <device name="ilo_srv5"/>
                </method>
        </fence>
</clusternode>

</clusternodes>
<fencedevices>
        <fencedevice name="ilo_srv4" agent="fence_ilo" hostname="srv4" 
login="ilo" password="test" />
        <fencedevice name="ilo_srv5" agent="fence_ilo" hostname="srv5" 
login="ilo" password="test" />
</fencedevices>

</cluster>




From lhh at redhat.com  Tue Jul 17 21:17:58 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 17 Jul 2007 17:17:58 -0400
Subject: [Linux-cluster] rgmanager
In-Reply-To: <469C621C.6060902@cesca.es>
References: <469C621C.6060902@cesca.es>
Message-ID: <20070717211758.GB9417@redhat.com>

On Tue, Jul 17, 2007 at 08:30:52AM +0200, Jordi Prats wrote:
> Hi,
> I've been searching for some documentation about the scripts used by 
> resource manager, but I found nothing. I've been using the start,stop 
> and status options on my scripts so if I want to move a service it stops 
> it and restarts it on the other node. There's any standard way to do 
> relocate (move the service)?

Normally, you want to stop the service before starting it on another
node; stop on one node followed by start on another node is called a
"relocation" in the context of linux-cluster or Red Hat's products.

> I've seen that the script vm.sh on cluster-2.0 sources have a "migrate" 
> this will do the job? (On the script is not implemented) There's any 
> more options?

If, however, you have a service which is capable of safely relocating
(like a virtual machine), you can implement the "migrate" option.

Note that in order to implement migration, you will need a special kind
of script called a resource agent, which is basically a script with
some XML that will tell the cluster software what it can do and what
its parameters are.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From hal_bg at yahoo.com  Tue Jul 17 22:48:38 2007
From: hal_bg at yahoo.com (Hal)
Date: Tue, 17 Jul 2007 15:48:38 -0700 (PDT)
Subject: [Linux-cluster] mount -t gfs error
In-Reply-To: <20070717211758.GB9417@redhat.com>
Message-ID: <949534.71832.qm@web32205.mail.mud.yahoo.com>

Hi all

I am trying to reproduce the setup described in Minimum GFS HowTo.

I had some problems which I managed to solve and successfully exported,
imported devices, made gfs on the gnbd device but:

# ccsd
# cman_tool join
# fence_tool join
fence_tool: can't communicate with fenced -1
# fenced -D
1184700474 cman_init error 0 111

I am using fence_gnbd.

I decided to try to mount gfs anyway:
# mount -t gfs /dev/gnbd/global_dev /mnt
mount: Cannot allocate memory

before that I tried ext3 fs on the same device and mount worked.

What might be wrong?

regards,
Hal


      ____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 



From chiranthlk at yahoo.com  Wed Jul 18 01:44:03 2007
From: chiranthlk at yahoo.com (chirantha pitigala)
Date: Tue, 17 Jul 2007 18:44:03 -0700 (PDT)
Subject: [Linux-cluster] RGMANAGER segmentation fault
Message-ID: <46155.48238.qm@web52309.mail.re2.yahoo.com>

Hi all,

We have received segmentation fault at rgmanager several times while restarting a service. (Not frequent). After that automatic restart of the server happened.
My OS is RHEL 4 Update 4 (2.6.9-42.ELsmp)
rgmanager-1.9.54-1
All the cluster packages are RHEL Update4.
I saw this bug has been fixed in RHEL Update3. 

Log is as followed.

Jul 16 11:09:50 UI1 clurgmgrd[3631]: <notice> Stopping service reportgenerator Jul 16 11:09:51 UI1 clurgmgrd: [3631]: <info> Executing /etc/init.d/reportgenerator stop Jul 16 11:09:52 UI1 snmptrapd[3478]: 2007-07-16 11:09:52 192.168.40.1
[192.168.40.1]: SNMPv2-MIB::sysUpTime.0 = Timeticks: (283880832) 32 days,
20:33:28.32     SNMPv2-MIB::snmpTrapOID.0 = OID: UCD-SNMP-MIB::linux   
SNMPv2-SMI::private.2021.2.1.101 = STRING: "ReportGenerator stopped on UI1"
Jul 16 11:09:53 UI1 clurgmgrd[3631]: <notice> Service reportgenerator is stopped Jul 16 11:09:53 UI1 kernel: clurgmgrd[28708]: segfault at 0000000000000050 rip 000000000040399e rsp 0000000043245010 error 4 Jul 16 11:10:01 UI1 crond(pam_unix)[32385]: session opened for user root by (uid=0) Jul 16 11:10:01 UI1 crond(pam_unix)[32387]: session opened for user ccs by
(uid=0)
Jul 16 11:10:01 UI1 crond(pam_unix)[32389]: session opened for user ccs by
(uid=0)
Jul 16 11:10:01 UI1 crond(pam_unix)[32391]: session opened for user ccs by
(uid=0)
Jul 16 11:10:01 UI1 crond(pam_unix)[32389]: session closed for user ccs Jul 16 11:10:01 UI1 crond(pam_unix)[32385]: session closed for user root Jul 16 11:10:01 UI1 crond(pam_unix)[32387]: session closed for user ccs Jul 16 11:10:02 UI1 crond(pam_unix)[32391]: session closed for user ccs Jul 16 11:11:01 UI1 clurgmgrd[3630]: <crit> Watchdog: Daemon died, rebooting...
Jul 16 11:11:01 UI1 kernel: md: stopping all md devices.
Jul 16 11:11:01 UI1 kernel: md: md0 switched to read-only mode.
Jul 16 11:14:52 UI1 syslogd 1.4.1: restart.

Can anyone advise on this matter?

thanks and regards

chirantha




 
____________________________________________________________________________________
It's here! Your new message!  
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070717/b5a1e2f7/attachment.htm>

From sebastia at l00-bugdead-prods.de  Wed Jul 18 06:10:45 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Wed, 18 Jul 2007 08:10:45 +0200
Subject: [Linux-cluster] mount -t gfs error
Message-ID: <20070718061045.885EF48023@l00-bugdead-prods.de>

Hi,

linux clustering <linux-cluster at redhat.com> wrote: 
> Hi all
> 
> I am trying to reproduce the setup described in Minimum GFS HowTo.
> 
> I had some problems which I managed to solve and successfully exported,
> imported devices, made gfs on the gnbd device but:
> 
> # ccsd
> # cman_tool join
> # fence_tool join
> fence_tool: can't communicate with fenced -1
> # fenced -D
> 1184700474 cman_init error 0 111
> 
do you have groupd running too?

kind regards
Sebastian



From hal_bg at yahoo.com  Wed Jul 18 07:42:48 2007
From: hal_bg at yahoo.com (Hal)
Date: Wed, 18 Jul 2007 00:42:48 -0700 (PDT)
Subject: [Linux-cluster] mount -t gfs error
In-Reply-To: <20070718061045.885EF48023@l00-bugdead-prods.de>
Message-ID: <235962.76380.qm@web32207.mail.mud.yahoo.com>


--- Sebastian Reitenbach <sebastia at l00-bugdead-prods.de> wrote:

> Hi,
> 
> linux clustering <linux-cluster at redhat.com> wrote: 
> > Hi all
> > 
> > I am trying to reproduce the setup described in Minimum GFS HowTo.
> > 
> > I had some problems which I managed to solve and successfully exported,
> > imported devices, made gfs on the gnbd device but:
> > 
> > # ccsd
> > # cman_tool join
> > # fence_tool join
> > fence_tool: can't communicate with fenced -1
> > # fenced -D
> > 1184700474 cman_init error 0 111
> > 
> do you have groupd running too?

No I had not :(
I have just tryed but the attempt was unsuccesful
# groupd
# ps -ef |grep groupd
root      1361 32302  0 07:24 pts/2    00:00:00 grep groupd
# groupd -D
1184732218 cman_init error 0 111


hal
produces the same error as fenced.


      ____________________________________________________________________________________
Shape Yahoo! in your own image.  Join our Network Research Panel today!   http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 




From hal_bg at yahoo.com  Wed Jul 18 11:58:06 2007
From: hal_bg at yahoo.com (Hal)
Date: Wed, 18 Jul 2007 04:58:06 -0700 (PDT)
Subject: [Linux-cluster] GNBD and gfs - wrong FS type
In-Reply-To: <235962.76380.qm@web32207.mail.mud.yahoo.com>
Message-ID: <745589.7708.qm@web32201.mail.mud.yahoo.com>

hallo 
I have trouble mounting GNBD inported gfs on both nodes of my test clusuer. If
the lock is set to "lock_nolock" it mounts fine but this is not what i want.
When I use lock_dlm I get: 
mount: wrong fs type, bad option, bad superblock on /dev/gnbd/global_disk,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

What I am doing wrong?
Total output follows (Selinux is NOT in enforcing mode):

[root at node2 ~]# modprobe gnbd
[root at node2 ~]# modprobe gfs2
[root at node2 ~]# modprobe gfs
[root at node2 ~]# modprobe lock_dlm
[root at node2 ~]# gnbd_import -n -i 192.168.0.60
gnbd_import: created directory /dev/gnbd
gnbd_import: created gnbd device global_disk
gnbd_recvd: gnbd_recvd started
[root at node2 ~]# cd /etc/init.d/
[root at node2 init.d]# ./cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
[root at node2 ~]# gfs_mkfs -p lock_dlm -t testc:gfs1 -j6 /dev/gnbd/global_disk 
This will destroy any data on /dev/gnbd/global_disk.
  It appears to contain a gfs filesystem.

Are you sure you want to proceed? [y/n] y

Device:                    /dev/gnbd/global_disk
Blocksize:                 4096
Filesystem Size:           851880
Journals:                  6
Resource Groups:           14
Locking Protocol:          lock_dlm
Lock Table:                testc:gfs1

Syncing...
All Done
[root at node2 ~]# mount -t gfs /dev/gnbd/global_disk /mnt
mount: wrong fs type, bad option, bad superblock on /dev/gnbd/global_disk,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

[root at node2 ~]# dmesg |tail
GFS: fsid=testc:gfs1.0: Scanning for log elements...
GFS: fsid=testc:gfs1.0: Found 0 unlinked inodes
GFS: fsid=testc:gfs1.0: Found quota changes for 0 IDs
GFS: fsid=testc:gfs1.0: Done
SELinux: initialized (dev gnbd0, type gfs), uses xattr
audit(1184744195.259:4): avc:  denied  { getattr } for  pid=1848 comm="hald"
name="global_disk" dev=tmpfs ino=19253 scontext=system_u:system_r:hald_t:s0
tcontext=root:object_r:device_t:s0 tclass=blk_file
Trying to join cluster "lock_dlm", "testc:gfs1"
Joined cluster. Now mounting FS...
GFS: fsid=testc:gfs1.4294967295: can't mount journal #4294967295
GFS: fsid=testc:gfs1.4294967295: there are only 6 journals (0 - 5)
[root at node2 ~]# 


 
____________________________________________________________________________________
Now that's room service!  Choose from over 150,000 hotels
in 45,000 destinations on Yahoo! Travel to find your fit.
http://farechase.yahoo.com/promo-generic-14795097



From henker at evendi.de  Wed Jul 18 12:58:14 2007
From: henker at evendi.de (Henke)
Date: Wed, 18 Jul 2007 14:58:14 +0200
Subject: [Linux-cluster] GFS inodes
Message-ID: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>


Hi,

I have a problem using GFS 6.1 in a 4-node cluster.
My scenario is as follows:
4 nodes share a 100GB SAN device.
node 1 generates data, while nodes 2-4 only read that data (although the 
gfs is mounted rw).
The amount of shared data is ~ 3GB.
node 1 creates new data and moves the old data. After the mv, the old 
directory is removed.
nodes 2-4 notice that the data has changed and restart.

However, I still wonder why the deleted inodes are never de-allocated.
Without manual intervention (eg. "service gfs restart") on one node 1,
the filesystem grows about 3GB/day although the actual data is still only
~3GB.
So, as a workaround, I constantly restart the GFS from time to time on 
node 1 and the inodes are de-allocated:

Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Scanning for 
log elements...
Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Found 92 
unlinked inodes
Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Found quota 
changes for 0 IDs
Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Done

- I tried various parameters using gfs_tool, however, the deleted inodes 
never get removed unless I restart the whole GFS on node 1 - is there any 
way to circumvent this issue ?
inoded_secs is at 15 and neither gfs_tool reclaim nor gfs_tool shrink show 
any inodes reclaimed.




From maciej.bogucki at artegence.com  Wed Jul 18 13:51:25 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Wed, 18 Jul 2007 15:51:25 +0200
Subject: [Linux-cluster] GFS inodes
In-Reply-To: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>
References: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>
Message-ID: <469E1ADD.9090309@artegence.com>

> - I tried various parameters using gfs_tool, however, the deleted inodes
> never get removed unless I restart the whole GFS on node 1 - is there
> any way to circumvent this issue ?
> inoded_secs is at 15 and neither gfs_tool reclaim nor gfs_tool shrink
> show any inodes reclaimed.

Hello,

Did you try "gfs_tool reclaim"? You could also read more here

http://archives.free.net.ph/message/20061220.150843.80921ed1.en.html

Best Regards
Maciej Bogucki



From stuarta at squashedfrog.net  Thu Jul 19 10:18:32 2007
From: stuarta at squashedfrog.net (Stuart Auchterlonie)
Date: Thu, 19 Jul 2007 11:18:32 +0100
Subject: [Linux-cluster] Communication between LVS nodes
In-Reply-To: <46408610.2080605@sans.org>
References: <46408610.2080605@sans.org>
Message-ID: <469F3A78.8020406@squashedfrog.net>

Firstly apologies for answering a 2 month old question.
answers below.

David Goldsmith wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Assume I have an LVS cluster setup with two LVS routers and 4 LVS member
> nodes.  Two of the nodes provide load balanced web servers.  Two of the
> nodes provide load-balanced proxy servers.
> 
> External customers connecting to the site can be passed to either of the
> two web server nodes.
> 
> Internal folks doing Internet browsing can be passed to either of the
> two proxy server nodes (assuming their web browser is configured to use
> the proxy).
> 
> Can the web servers in the LVS cluster use the LVS interface to the
> proxy servers rather than communicating directly to one of the two proxy
> server nodes?

Yes they can, you just have to abide by a few specific rules.
(I'm also assuming you are using DR)

1. The virtual ip address of the proxy service is different to
    the virtual ip address of the web service
2. The loopback adapter on each node only contains the virtual
    ip address of the service it's providing, either proxy or
    web, but not both. Otherwise the node will attempt to talk
    to itself.
3. The internal cluster network between the LVS directors and the
    LVS nodes is only used to forward packets from the directors
    to the real nodes, and is not used as the default route to the
    network.
4. The virtual ip addresses don't live in the same ip network as
    the realservers providing the service. ie.
    - 10.1.1.0/24 = virtual ip network
    - 10.1.2.0/24 = private cluster network
    - 10.1.3.0/25 = real server's externally routed network.

Essentially the whole arp problem is being avoided by putting
the virtual services on a different network, forcing connections
from the web servers (in this case) to be routed out to the
virtual addresses, rather than being allowed to "see" the real
servers and suffer the arp problem.

Maybe a picture will help

------------------------------------------ Virt Network (A)
              |               |
          Director        Director
              |               |
------------------------------------------ Priv Network (B)
    |           |           |          |
  RealWeb    RealWeb    RealProxy   RealProxy
    |           |           |          |
------------------------------------------ Real Server Net (C)

So any outbound connections from the RealWeb servers must
originate on (C) and due to being on different networks
get routed to (A) and then onto the RealProxy servers.


That's essentially what i have at the moment except that
the RealWeb servers are IIS nodes (ugh!) and network (C) is
split further into (C) & (D), with the different types of
realservers on different networks.


I am going to sit down and put this on a website one day
since it's in the "think it might work, but dunno" of every
document i read while setting it up.

Oh and as a footnote for the archives the windows boxes
can't bind to a /32 network on their loopback adapter,
so you will have to plan on /30 spacing of your virtual
ip addresses.


Regards,
Stuart Auchterlonie


> 
> If not, and the web server nodes are configured to connect to one
> specific proxy node, that would seem to create a possible failure point.
> 
> Thanks
> - --
> David Goldsmith



From tomas.hoger at gmail.com  Thu Jul 19 14:43:40 2007
From: tomas.hoger at gmail.com (Tomas Hoger)
Date: Thu, 19 Jul 2007 16:43:40 +0200
Subject: [Linux-cluster] doubts in Piranha
In-Reply-To: <20070711203622.GW18076@redhat.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A051EDB9B@inblr-exch1.eu.uis.unisys.com>
	<BAY105-F212B5A239D83A1E826C41AE3040@phx.gbl>
	<20070711143225.GO18076@redhat.com>
	<6cfbd1b40707110802y239fa1a0j870cdccbd61063ec@mail.gmail.com>
	<20070711203622.GW18076@redhat.com>
Message-ID: <6cfbd1b40707190743l4b46d254of4802184ce230535@mail.gmail.com>

On 7/11/07, Lon Hohberger <lhh at redhat.com> wrote:
> On Wed, Jul 11, 2007 at 05:02:43PM +0200, Tomas Hoger wrote:
[...]
> > Regarding documentation: It only mentions arptables as ARP problem
> > fix.  For kernel 2.6+, there is also another way using sysctl, which
> > does not require additional kernel patches / modules.
>
> There is an IPtables one in the docs too, which works on both 2.4.x and
> 2.6.x IIRC.

Yes Lon, you're right.  IPtables method using REDIRECT is documented
too.  I've overlooked it at first quick look.

However, I was referring to the following method:

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.arp_problem.html#2_6_arp_announce

th.



From sebastian.walter at fu-berlin.de  Thu Jul 19 16:58:14 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Thu, 19 Jul 2007 18:58:14 +0200
Subject: [Linux-cluster] Hassle with clvmd over external network
Message-ID: <469F9826.70007@fu-berlin.de>

Dear list,

I'm trying to set up RHCS and GFS in a cluster which has two network
memberships, one is internal (eth0, 10.1.0.0/16, dns-names: host.local),
the other external (eth1, our real-world subnet and dns names). Every
cluster node has these two interfaces and related ip-numbers in both
networks.

When I set up the RHCS and GFS on the local subnet, everything works
fine (ccsd, cman, clvmd, ... and also gfs mountable volumes). But if I
try to change cluster.conf to use the real-world addresses (I want to
use the gfs volumes also outside of the cluster), clvmd always makes
problems. I followed the faq and changed /etc/init.d/cman to connect
with -n host.external.dns.com. All hosts are in /etc/hosts. ccsd starts
well on all clients, as does cman and fenced. But when I try to start
the clvmd service on all nodes simultaneously, I get errors (Starting
clvmd: clvmd startup timed out).

This is what my /proc gives me:
[root at dtm ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                          11   2 run       -
[8 7 6 5 4 3 2 9 10 11 1]

DLM Lock Space:  "clvmd"                            14   3 join     
S-6,20,11
[8 7 6 5 4 3 2 9 10 11 1]

[root at compute-0-1 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                          11   2 run       -
[2 3 4 5 6 7 9 10 8 11 1]

DLM Lock Space:  "clvmd"                            14   3 update    U-4,1,1
[2 3 4 5 6 7 8 9 10 11 1]

(the second output I get from all the other nodes, I think it depends on
which host i start the service first on)

Has anybody an idea how clvmd communicates to each other? cman is doing
fine... any other experiences? Thanks for any advice...

Regards,
Sebastian



From hal_bg at yahoo.com  Thu Jul 19 19:43:36 2007
From: hal_bg at yahoo.com (Hal)
Date: Thu, 19 Jul 2007 12:43:36 -0700 (PDT)
Subject: [Linux-cluster] GNBD and gfs - wrong FS type
In-Reply-To: <745589.7708.qm@web32201.mail.mud.yahoo.com>
Message-ID: <756426.98457.qm@web32213.mail.mud.yahoo.com>

Just for the record :)
problem solved! 

For some reason to use gfs1 on FC6 one has to install gfs2 tools.
and since gfs2 from cluster-2.00.00 sources does not compile, gfs2-utils
should be installed and miraculously from this very moment gfs starts 
working...
How can one guess that gfs requires gfs2 tools along with gfs1 tools, just 
like gfs.ko needs gfs2.ko?

Hal

--- Hal <hal_bg at yahoo.com> wrote:

> hallo 
> I have trouble mounting GNBD inported gfs on both nodes of my test clusuer.
> If
> the lock is set to "lock_nolock" it mounts fine but this is not what i want.
> When I use lock_dlm I get: 
> mount: wrong fs type, bad option, bad superblock on /dev/gnbd/global_disk,
>        missing codepage or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
> 
> What I am doing wrong?
> Total output follows (Selinux is NOT in enforcing mode):
> 
> [root at node2 ~]# modprobe gnbd
> [root at node2 ~]# modprobe gfs2
> [root at node2 ~]# modprobe gfs
> [root at node2 ~]# modprobe lock_dlm
> [root at node2 ~]# gnbd_import -n -i 192.168.0.60
> gnbd_import: created directory /dev/gnbd
> gnbd_import: created gnbd device global_disk
> gnbd_recvd: gnbd_recvd started
> [root at node2 ~]# cd /etc/init.d/
> [root at node2 init.d]# ./cman start
> Starting cluster: 
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... done
>    Starting daemons... done
>    Starting fencing... done
>                                                            [  OK  ]
> [root at node2 ~]# gfs_mkfs -p lock_dlm -t testc:gfs1 -j6 /dev/gnbd/global_disk 
> This will destroy any data on /dev/gnbd/global_disk.
>   It appears to contain a gfs filesystem.
> 
> Are you sure you want to proceed? [y/n] y
> 
> Device:                    /dev/gnbd/global_disk
> Blocksize:                 4096
> Filesystem Size:           851880
> Journals:                  6
> Resource Groups:           14
> Locking Protocol:          lock_dlm
> Lock Table:                testc:gfs1
> 
> Syncing...
> All Done
> [root at node2 ~]# mount -t gfs /dev/gnbd/global_disk /mnt
> mount: wrong fs type, bad option, bad superblock on /dev/gnbd/global_disk,
>        missing codepage or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
> 
> [root at node2 ~]# dmesg |tail
> GFS: fsid=testc:gfs1.0: Scanning for log elements...
> GFS: fsid=testc:gfs1.0: Found 0 unlinked inodes
> GFS: fsid=testc:gfs1.0: Found quota changes for 0 IDs
> GFS: fsid=testc:gfs1.0: Done
> SELinux: initialized (dev gnbd0, type gfs), uses xattr
> audit(1184744195.259:4): avc:  denied  { getattr } for  pid=1848 comm="hald"
> name="global_disk" dev=tmpfs ino=19253 scontext=system_u:system_r:hald_t:s0
> tcontext=root:object_r:device_t:s0 tclass=blk_file
> Trying to join cluster "lock_dlm", "testc:gfs1"
> Joined cluster. Now mounting FS...
> GFS: fsid=testc:gfs1.4294967295: can't mount journal #4294967295
> GFS: fsid=testc:gfs1.4294967295: there are only 6 journals (0 - 5)
> [root at node2 ~]# 
> 
> 
>  
>
____________________________________________________________________________________
> Now that's room service!  Choose from over 150,000 hotels
> in 45,000 destinations on Yahoo! Travel to find your fit.
> http://farechase.yahoo.com/promo-generic-14795097
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



       
____________________________________________________________________________________
Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games.
http://sims.yahoo.com/  



From pcaulfie at redhat.com  Fri Jul 20 07:09:08 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 20 Jul 2007 08:09:08 +0100
Subject: [Linux-cluster] Hassle with clvmd over external network
In-Reply-To: <469F9826.70007@fu-berlin.de>
References: <469F9826.70007@fu-berlin.de>
Message-ID: <46A05F94.9090405@redhat.com>

Sebastian Walter wrote:
> Dear list,
> 
> I'm trying to set up RHCS and GFS in a cluster which has two network
> memberships, one is internal (eth0, 10.1.0.0/16, dns-names: host.local),
> the other external (eth1, our real-world subnet and dns names). Every
> cluster node has these two interfaces and related ip-numbers in both
> networks.
> 
> When I set up the RHCS and GFS on the local subnet, everything works
> fine (ccsd, cman, clvmd, ... and also gfs mountable volumes). But if I
> try to change cluster.conf to use the real-world addresses (I want to
> use the gfs volumes also outside of the cluster), clvmd always makes
> problems. I followed the faq and changed /etc/init.d/cman to connect
> with -n host.external.dns.com. All hosts are in /etc/hosts. ccsd starts
> well on all clients, as does cman and fenced. But when I try to start
> the clvmd service on all nodes simultaneously, I get errors (Starting
> clvmd: clvmd startup timed out).
> 
> This is what my /proc gives me:
> [root at dtm ~]# cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                          11   2 run       -
> [8 7 6 5 4 3 2 9 10 11 1]
> 
> DLM Lock Space:  "clvmd"                            14   3 join     
> S-6,20,11
> [8 7 6 5 4 3 2 9 10 11 1]
> 
> [root at compute-0-1 ~]# cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                          11   2 run       -
> [2 3 4 5 6 7 9 10 8 11 1]
> 
> DLM Lock Space:  "clvmd"                            14   3 update    U-4,1,1
> [2 3 4 5 6 7 8 9 10 11 1]
> 
> (the second output I get from all the other nodes, I think it depends on
> which host i start the service first on)
> 
> Has anybody an idea how clvmd communicates to each other? cman is doing
> fine... any other experiences? Thanks for any advice...
> 


It'll be waiting for the DLM lockspace creation to complete on all nodes. Have a
look in syslog for DLM messages.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From hal_bg at yahoo.com  Fri Jul 20 08:39:13 2007
From: hal_bg at yahoo.com (Hal)
Date: Fri, 20 Jul 2007 01:39:13 -0700 (PDT)
Subject: [Linux-cluster] GFS locks wnen a node fails
Message-ID: <224214.96819.qm@web32204.mail.mud.yahoo.com>

Hallo everybody,
I have a test cluster of 4 machines. node0 - gnbd server and gnbd fence server
and 3 nodes to mount gfs. The problem is that when I unplug one of the nodes,
gfs locks and no one can access it until the node is reconnected. 

How can this lock be avoided if one node fails? 
How can I tell that gnbd-fencing is working at all?

"gnbd_import -c node0" says nothing even if I do "fence_node node2" I assume 
fencing is not working am I right?

Regards 
Hal


       
____________________________________________________________________________________Ready for the edge of your seat? 
Check out tonight's top picks on Yahoo! TV. 
http://tv.yahoo.com/



From maciej.bogucki at artegence.com  Fri Jul 20 10:01:16 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 20 Jul 2007 12:01:16 +0200
Subject: [Linux-cluster] GFS locks wnen a node fails
In-Reply-To: <224214.96819.qm@web32204.mail.mud.yahoo.com>
References: <224214.96819.qm@web32204.mail.mud.yahoo.com>
Message-ID: <46A087EC.7050509@artegence.com>

Hal napisa?(a):
> Hallo everybody,
> I have a test cluster of 4 machines. node0 - gnbd server and gnbd fence server
> and 3 nodes to mount gfs. The problem is that when I unplug one of the nodes,
> gfs locks and no one can access it until the node is reconnected. 
> 
> How can this lock be avoided if one node fails? 
> How can I tell that gnbd-fencing is working at all?
> 
> "gnbd_import -c node0" says nothing even if I do "fence_node node2" I assume 
> fencing is not working am I right?

Hello,

It looks like You don't hava fencing properly configured.
You have to check Your logs, to see what is going on.
If Your fence agent failed, GFS filesystem will be freezed(no ro/rw
operations permited) until You perform manual fencing.

Best Regards
Maciej Bogucki



From sebastian.walter at fu-berlin.de  Fri Jul 20 10:59:12 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Fri, 20 Jul 2007 12:59:12 +0200
Subject: [Linux-cluster] Hassle with clvmd over external network
In-Reply-To: <46A05F94.9090405@redhat.com>
References: <469F9826.70007@fu-berlin.de> <46A05F94.9090405@redhat.com>
Message-ID: <46A09580.8060600@fu-berlin.de>

Hello Patrick, thanks for your comment.

I can't see any messages regarding dlm in the syslog. When I call
`strace clvmd -d`, the first two lines are as followed:
[root at compute-0-1 ~]# strace clvmd -d
execve("/usr/sbin/clvmd", ["clvmd", "-d"], [/* 34 vars */]) = 0
uname({sys="Linux", node="compute-0-1.local", ...}) = 0
[...]

So, obviously, clvmd doesn't use the hostnames from cluster.conf (where
it should be c0-1.external.dns.com), but the ones provided by uname.
Does anybody know how to change this behaviour? Any help is greatly
appreciated!

Regards,
Sebastian


Patrick Caulfield wrote:
> Sebastian Walter wrote:
>   
>> Dear list,
>>
>> I'm trying to set up RHCS and GFS in a cluster which has two network
>> memberships, one is internal (eth0, 10.1.0.0/16, dns-names: host.local),
>> the other external (eth1, our real-world subnet and dns names). Every
>> cluster node has these two interfaces and related ip-numbers in both
>> networks.
>>
>> When I set up the RHCS and GFS on the local subnet, everything works
>> fine (ccsd, cman, clvmd, ... and also gfs mountable volumes). But if I
>> try to change cluster.conf to use the real-world addresses (I want to
>> use the gfs volumes also outside of the cluster), clvmd always makes
>> problems. I followed the faq and changed /etc/init.d/cman to connect
>> with -n host.external.dns.com. All hosts are in /etc/hosts. ccsd starts
>> well on all clients, as does cman and fenced. But when I try to start
>> the clvmd service on all nodes simultaneously, I get errors (Starting
>> clvmd: clvmd startup timed out).
>>
>> This is what my /proc gives me:
>> [root at dtm ~]# cat /proc/cluster/services
>> Service          Name                              GID LID State     Code
>> Fence Domain:    "default"                          11   2 run       -
>> [8 7 6 5 4 3 2 9 10 11 1]
>>
>> DLM Lock Space:  "clvmd"                            14   3 join     
>> S-6,20,11
>> [8 7 6 5 4 3 2 9 10 11 1]
>>
>> [root at compute-0-1 ~]# cat /proc/cluster/services
>> Service          Name                              GID LID State     Code
>> Fence Domain:    "default"                          11   2 run       -
>> [2 3 4 5 6 7 9 10 8 11 1]
>>
>> DLM Lock Space:  "clvmd"                            14   3 update    U-4,1,1
>> [2 3 4 5 6 7 8 9 10 11 1]
>>
>> (the second output I get from all the other nodes, I think it depends on
>> which host i start the service first on)
>>
>> Has anybody an idea how clvmd communicates to each other? cman is doing
>> fine... any other experiences? Thanks for any advice...
>>
>>     
>
>
> It'll be waiting for the DLM lockspace creation to complete on all nodes. Have a
> look in syslog for DLM messages.
>
>   



From pillai at mathstat.dal.ca  Fri Jul 20 11:01:25 2007
From: pillai at mathstat.dal.ca (Balagopal Pillai)
Date: Fri, 20 Jul 2007 08:01:25 -0300 (ADT)
Subject: [Linux-cluster] GFS locks wnen a node fails
In-Reply-To: <46A087EC.7050509@artegence.com>
References: <224214.96819.qm@web32204.mail.mud.yahoo.com>
	<46A087EC.7050509@artegence.com>
Message-ID: <Pine.LNX.4.64.0707200758550.28490@chase.mathstat.dal.ca>

On Fri, 20 Jul 2007, Maciej Bogucki wrote:
Hi,

      I saw this problem when i was looking at gfs as an option for an 
hpc cluster. Did you mount with oopses_ok? With that option, if one node 
has a problem, gfs mounts on other nodes too tend to hang needing a total 
cluster restart. Ideally, the nodes should panic when they have gfs 
problems and the other nodes would stay unaffected.

Regards
Balagopal


> Hal napisa?(a):
> > Hallo everybody,
> > I have a test cluster of 4 machines. node0 - gnbd server and gnbd fence server
> > and 3 nodes to mount gfs. The problem is that when I unplug one of the nodes,
> > gfs locks and no one can access it until the node is reconnected. 
> > 
> > How can this lock be avoided if one node fails? 
> > How can I tell that gnbd-fencing is working at all?
> > 
> > "gnbd_import -c node0" says nothing even if I do "fence_node node2" I assume 
> > fencing is not working am I right?
> 
> Hello,
> 
> It looks like You don't hava fencing properly configured.
> You have to check Your logs, to see what is going on.
> If Your fence agent failed, GFS filesystem will be freezed(no ro/rw
> operations permited) until You perform manual fencing.
> 
> Best Regards
> Maciej Bogucki
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From maciej.bogucki at artegence.com  Fri Jul 20 11:13:12 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 20 Jul 2007 13:13:12 +0200
Subject: [Linux-cluster] Hassle with clvmd over external network
In-Reply-To: <46A09580.8060600@fu-berlin.de>
References: <469F9826.70007@fu-berlin.de> <46A05F94.9090405@redhat.com>
	<46A09580.8060600@fu-berlin.de>
Message-ID: <46A098C8.306@artegence.com>

Sebastian Walter napisa?(a):
> Hello Patrick, thanks for your comment.
> 
> I can't see any messages regarding dlm in the syslog. When I call
> `strace clvmd -d`, the first two lines are as followed:
> [root at compute-0-1 ~]# strace clvmd -d
> execve("/usr/sbin/clvmd", ["clvmd", "-d"], [/* 34 vars */]) = 0
> uname({sys="Linux", node="compute-0-1.local", ...}) = 0
> [...]
> 
> So, obviously, clvmd doesn't use the hostnames from cluster.conf (where
> it should be c0-1.external.dns.com), but the ones provided by uname.
> Does anybody know how to change this behaviour? Any help is greatly
> appreciated!

Hello,

Add

1.1.1.1 c0-1.external.dns.com

to /etc/hosts, where 1.1.1.1 is IP address of Your external network, and
it should help.

Best Regards
Maciej Bogucki



From Dan.Askew at jmsmucker.com  Fri Jul 20 12:00:44 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Fri, 20 Jul 2007 08:00:44 -0400
Subject: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <Pine.LNX.4.64.0707200758550.28490@chase.mathstat.dal.ca>
Message-ID: <OF345DD442.1BED8381-ON8525731E.0040CD73-8525731E.004270B7@jmsmucker.com>

Greetings all,

I am an old veteran of HP-UX Service Guard. I am trying to get a NFS linux 
Cluster working and need some advise.

I have read the NFS Cookbook from Redhat and have a the following working

2 Node Linux Cluster  (RHEL AS 4.0 update 5)

one test disk LVM formated ext3 (have not decided onGFS or not)

Use Vitual IPAddress to access the disks via NFS

When SYSTEMA  runs the service and the client machine access the disk and 
makes changes. Then I fail over to SYSTEMB these changes made by the 
client are not present. 

I am runing CLVMD deamon 
The LVM disks are mounted on both systems
I have made the following changes to LVM.CONF

(I have tried locking_type = 2 and locking_type = 3) both have he same 
results. (as above)

Sorry for my ignorance but can anyone tell me what I am doing 
wrong...would GFS solve the syncing problem? 





Dan Askew
Sr Systems Administrator
The J. M. Smucker Company

Phone (330)684-3662

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/ee8b7dd3/attachment.htm>

From kristoffer.lippert at jppol.dk  Fri Jul 20 12:13:59 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Fri, 20 Jul 2007 14:13:59 +0200
Subject: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <OF345DD442.1BED8381-ON8525731E.0040CD73-8525731E.004270B7@jmsmucker.com>
References: <Pine.LNX.4.64.0707200758550.28490@chase.mathstat.dal.ca>
	<OF345DD442.1BED8381-ON8525731E.0040CD73-8525731E.004270B7@jmsmucker.com>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F1@exchsrv07.rootdom.dk>

Hi,
 
You need gfs for the changes to appear on both servers. With GFS, when one server changes a file, the other server is made aware of the changes. Also GFS takes care of file locking. Also you need a fencedevice, so the cluster can shutdown a "defective" server, and make sure it dosn't corrupt the GFS. 
 
For your current setup:
When you have both servers running, you could mount the ext3 fs on both servers, but only the server that writes a file, will be aware of it. The other server will be aware of the new file when you remount the fs. 
 
Hope this helps a bit.
 
Kind Regards
Kristoffer
 
 

________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:01
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe



Greetings all, 

I am an old veteran of HP-UX Service Guard. I am trying to get a NFS linux Cluster working and need some advise. 

I have read the NFS Cookbook from Redhat and have a the following working 

2 Node Linux Cluster  (RHEL AS 4.0 update 5) 

one test disk LVM formated ext3 (have not decided onGFS or not) 

Use Vitual IPAddress to access the disks via NFS 

When SYSTEMA  runs the service and the client machine access the disk and makes changes. Then I fail over to SYSTEMB these changes made by the client are not present. 

I am runing CLVMD deamon 
The LVM disks are mounted on both systems 
I have made the following changes to LVM.CONF 

(I have tried locking_type = 2 and locking_type = 3) both have he same results. (as above) 

Sorry for my ignorance but can anyone tell me what I am doing wrong...would GFS solve the syncing problem? 





Dan Askew
Sr Systems Administrator
The J. M. Smucker Company

Phone (330)684-3662


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/4540de96/attachment.htm>

From sebastian.walter at fu-berlin.de  Fri Jul 20 12:14:41 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Fri, 20 Jul 2007 14:14:41 +0200
Subject: [Linux-cluster] Hassle with clvmd over external network
In-Reply-To: <46A098C8.306@artegence.com>
References: <469F9826.70007@fu-berlin.de> <46A05F94.9090405@redhat.com>
	<46A09580.8060600@fu-berlin.de> <46A098C8.306@artegence.com>
Message-ID: <46A0A731.9060504@fu-berlin.de>

Thanks, Maciej.

This is already done, without it, also ccsd and cman wouldn't run...
cman is started as:
cman -n c0-1.external.dns.com

Regards,
Sebastian

Maciej Bogucki wrote:
> Sebastian Walter napisa?(a):
>   
>> Hello Patrick, thanks for your comment.
>>
>> I can't see any messages regarding dlm in the syslog. When I call
>> `strace clvmd -d`, the first two lines are as followed:
>> [root at compute-0-1 ~]# strace clvmd -d
>> execve("/usr/sbin/clvmd", ["clvmd", "-d"], [/* 34 vars */]) = 0
>> uname({sys="Linux", node="compute-0-1.local", ...}) = 0
>> [...]
>>
>> So, obviously, clvmd doesn't use the hostnames from cluster.conf (where
>> it should be c0-1.external.dns.com), but the ones provided by uname.
>> Does anybody know how to change this behaviour? Any help is greatly
>> appreciated!
>>     
>
> Hello,
>
> Add
>
> 1.1.1.1 c0-1.external.dns.com
>
> to /etc/hosts, where 1.1.1.1 is IP address of Your external network, and
> it should help.
>
> Best Regards
> Maciej Bogucki
>   



From Dan.Askew at jmsmucker.com  Fri Jul 20 12:26:12 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Fri, 20 Jul 2007 08:26:12 -0400
Subject: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F1@exchsrv07.rootdom.dk>
Message-ID: <OF80385CA9.A66C2A9E-ON8525731E.00442E2F-8525731E.0044C56F@jmsmucker.com>

Could you elaborate on the fence device.  What would suggest using ?






"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com
07/20/2007 08:13 AM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com>
cc

Subject
SV: [Linux-cluster] Linux Clustering Newbe






Hi,
 
You need gfs for the changes to appear on both servers. With GFS, when one 
server changes a file, the other server is made aware of the changes. Also 
GFS takes care of file locking. Also you need a fencedevice, so the 
cluster can shutdown a "defective" server, and make sure it dosn't corrupt 
the GFS. 
 
For your current setup:
When you have both servers running, you could mount the ext3 fs on both 
servers, but only the server that writes a file, will be aware of it. The 
other server will be aware of the new file when you remount the fs. 
 
Hope this helps a bit.
 
Kind Regards
Kristoffer
 
 

Fra: linux-cluster-bounces at redhat.com 
[mailto:linux-cluster-bounces at redhat.com] P? vegne af 
Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:01
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe


Greetings all, 

I am an old veteran of HP-UX Service Guard. I am trying to get a NFS linux 
Cluster working and need some advise. 

I have read the NFS Cookbook from Redhat and have a the following working 

2 Node Linux Cluster  (RHEL AS 4.0 update 5) 

one test disk LVM formated ext3 (have not decided onGFS or not) 

Use Vitual IPAddress to access the disks via NFS 

When SYSTEMA  runs the service and the client machine access the disk and 
makes changes. Then I fail over to SYSTEMB these changes made by the 
client are not present. 

I am runing CLVMD deamon 
The LVM disks are mounted on both systems 
I have made the following changes to LVM.CONF 

(I have tried locking_type = 2 and locking_type = 3) both have he same 
results. (as above) 

Sorry for my ignorance but can anyone tell me what I am doing 
wrong...would GFS solve the syncing problem? 




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/dea6feca/attachment.htm>

From kristoffer.lippert at jppol.dk  Fri Jul 20 12:36:17 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Fri, 20 Jul 2007 14:36:17 +0200
Subject: SV: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <OF80385CA9.A66C2A9E-ON8525731E.00442E2F-8525731E.0044C56F@jmsmucker.com>
References: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F1@exchsrv07.rootdom.dk>
	<OF80385CA9.A66C2A9E-ON8525731E.00442E2F-8525731E.0044C56F@jmsmucker.com>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F3@exchsrv07.rootdom.dk>

Hi,
 
A fence device is a device that can "build a fence" around a node, and thus keep it from corrupting a shared filesystem.
Most commonly i think a powerswitch is used. It simply cuts the power to the decfunct server.
It looks and works like this:
http://www.wti.com/guides/rpb115ug.htm
(but there are ofcourse loads of brands avaliabel - No, I don't work for WTI ;-)
 
Alternatively there is the option of "cutting" the network from the server (that would be the network between server and disks), using a switch. I have not tried that method of fencing, so someone else might be able to fill in. 

Depending on the topology of your cluster (are your disks connected thrugh a dedicated fiber net or for instance iscsi.) the need to "cut" the defective server off from the disks will be different.
 
So in short, A fence device can be many things. :-)
 
Hope it make sence
/Kristoffer
 

________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:26
Til: linux clustering
Emne: Re: SV: [Linux-cluster] Linux Clustering Newbe



Could you elaborate on the fence device.  What would suggest using ? 






"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com 

07/20/2007 08:13 AM 
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com> 
cc
Subject
SV: [Linux-cluster] Linux Clustering Newbe

	




Hi, 
  
You need gfs for the changes to appear on both servers. With GFS, when one server changes a file, the other server is made aware of the changes. Also GFS takes care of file locking. Also you need a fencedevice, so the cluster can shutdown a "defective" server, and make sure it dosn't corrupt the GFS. 
  
For your current setup: 
When you have both servers running, you could mount the ext3 fs on both servers, but only the server that writes a file, will be aware of it. The other server will be aware of the new file when you remount the fs. 
  
Hope this helps a bit. 
  
Kind Regards 
Kristoffer 
  
  


________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:01
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe


Greetings all, 

I am an old veteran of HP-UX Service Guard. I am trying to get a NFS linux Cluster working and need some advise. 

I have read the NFS Cookbook from Redhat and have a the following working 

2 Node Linux Cluster  (RHEL AS 4.0 update 5) 

one test disk LVM formated ext3 (have not decided onGFS or not) 

Use Vitual IPAddress to access the disks via NFS 

When SYSTEMA  runs the service and the client machine access the disk and makes changes. Then I fail over to SYSTEMB these changes made by the client are not present. 

I am runing CLVMD deamon 
The LVM disks are mounted on both systems 
I have made the following changes to LVM.CONF 

(I have tried locking_type = 2 and locking_type = 3) both have he same results. (as above) 

Sorry for my ignorance but can anyone tell me what I am doing wrong...would GFS solve the syncing problem? 




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/5e788b80/attachment.htm>

From Dan.Askew at jmsmucker.com  Fri Jul 20 12:41:17 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Fri, 20 Jul 2007 08:41:17 -0400
Subject: SV: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F3@exchsrv07.rootdom.dk>
Message-ID: <OF331CFDE1.E4E9EFDA-ON8525731E.00458381-8525731E.004626D3@jmsmucker.com>

Kristoffer,

Thanks for all the help. One last question....

Instead of GFS what do you think of OCFS2 from oracle...it is opensource 
It appears to accomplish the same funtion as GFS (I am still reading the 
material).







"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com
07/20/2007 08:36 AM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com>
cc

Subject
SV: SV: [Linux-cluster] Linux Clustering Newbe






Hi,
 
A fence device is a device that can "build a fence" around a node, and 
thus keep it from corrupting a shared filesystem.
Most commonly i think a powerswitch is used. It simply cuts the power to 
the decfunct server.
It looks and works like this:
http://www.wti.com/guides/rpb115ug.htm
(but there are ofcourse loads of brands avaliabel - No, I don't work for 
WTI ;-)
 
Alternatively there is the option of "cutting" the network from the server 
(that would be the network between server and disks), using a switch. I 
have not tried that method of fencing, so someone else might be able to 
fill in. 

Depending on the topology of your cluster (are your disks connected thrugh 
a dedicated fiber net or for instance iscsi.) the need to "cut" the 
defective server off from the disks will be different.
 
So in short, A fence device can be many things. :-)
 
Hope it make sence
/Kristoffer
 

Fra: linux-cluster-bounces at redhat.com 
[mailto:linux-cluster-bounces at redhat.com] P? vegne af 
Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:26
Til: linux clustering
Emne: Re: SV: [Linux-cluster] Linux Clustering Newbe


Could you elaborate on the fence device.  What would suggest using ? 





"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com 
07/20/2007 08:13 AM 

Please respond to
linux clustering <linux-cluster at redhat.com>



To
"linux clustering" <linux-cluster at redhat.com> 
cc

Subject
SV: [Linux-cluster] Linux Clustering Newbe








Hi, 
  
You need gfs for the changes to appear on both servers. With GFS, when one 
server changes a file, the other server is made aware of the changes. Also 
GFS takes care of file locking. Also you need a fencedevice, so the 
cluster can shutdown a "defective" server, and make sure it dosn't corrupt 
the GFS. 
  
For your current setup: 
When you have both servers running, you could mount the ext3 fs on both 
servers, but only the server that writes a file, will be aware of it. The 
other server will be aware of the new file when you remount the fs. 
  
Hope this helps a bit. 
  
Kind Regards 
Kristoffer 
 
 

Fra: linux-cluster-bounces at redhat.com 
[mailto:linux-cluster-bounces at redhat.com] P? vegne af 
Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:01
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe


Greetings all, 

I am an old veteran of HP-UX Service Guard. I am trying to get a NFS linux 
Cluster working and need some advise. 

I have read the NFS Cookbook from Redhat and have a the following working 

2 Node Linux Cluster  (RHEL AS 4.0 update 5) 

one test disk LVM formated ext3 (have not decided onGFS or not) 

Use Vitual IPAddress to access the disks via NFS 

When SYSTEMA  runs the service and the client machine access the disk and 
makes changes. Then I fail over to SYSTEMB these changes made by the 
client are not present. 

I am runing CLVMD deamon 
The LVM disks are mounted on both systems 
I have made the following changes to LVM.CONF 

(I have tried locking_type = 2 and locking_type = 3) both have he same 
results. (as above) 

Sorry for my ignorance but can anyone tell me what I am doing 
wrong...would GFS solve the syncing problem? 




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster --
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/edb24910/attachment.htm>

From kristoffer.lippert at jppol.dk  Fri Jul 20 12:47:14 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Fri, 20 Jul 2007 14:47:14 +0200
Subject: SV: SV: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <OF331CFDE1.E4E9EFDA-ON8525731E.00458381-8525731E.004626D3@jmsmucker.com>
References: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F3@exchsrv07.rootdom.dk>
	<OF331CFDE1.E4E9EFDA-ON8525731E.00458381-8525731E.004626D3@jmsmucker.com>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F4@exchsrv07.rootdom.dk>

hi,
 
yes. I considered it myself when setting up my cluster. I seems ok, but nobody seems to use it. (Am i wrong in this perception?) 
That and the fact that there are lots of gfs installations made me go for GFS. I tried GFS2 first, but I couldn't make it stable without rebuilding kernels, wich makes updates and upgrades tedius so i stuck with GFS.
 
/Kristoffer
 

________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:41
Til: linux clustering
Emne: Re: SV: SV: [Linux-cluster] Linux Clustering Newbe



Kristoffer, 

Thanks for all the help. One last question.... 

Instead of GFS what do you think of OCFS2 from oracle...it is opensource It appears to accomplish the same funtion as GFS (I am still reading the material). 







"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com 

07/20/2007 08:36 AM 
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com> 
cc
Subject
SV: SV: [Linux-cluster] Linux Clustering Newbe

	




Hi, 
  
A fence device is a device that can "build a fence" around a node, and thus keep it from corrupting a shared filesystem. 
Most commonly i think a powerswitch is used. It simply cuts the power to the decfunct server. 
It looks and works like this: 
http://www.wti.com/guides/rpb115ug.htm <http://www.wti.com/guides/rpb115ug.htm>  
(but there are ofcourse loads of brands avaliabel - No, I don't work for WTI ;-) 
  
Alternatively there is the option of "cutting" the network from the server (that would be the network between server and disks), using a switch. I have not tried that method of fencing, so someone else might be able to fill in. 

Depending on the topology of your cluster (are your disks connected thrugh a dedicated fiber net or for instance iscsi.) the need to "cut" the defective server off from the disks will be different. 
  
So in short, A fence device can be many things. :-) 
  
Hope it make sence 
/Kristoffer 
  


________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:26
Til: linux clustering
Emne: Re: SV: [Linux-cluster] Linux Clustering Newbe


Could you elaborate on the fence device.  What would suggest using ? 





"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com 

07/20/2007 08:13 AM 

Please respond to
linux clustering <linux-cluster at redhat.com>



To
"linux clustering" <linux-cluster at redhat.com> 
cc
Subject
SV: [Linux-cluster] Linux Clustering Newbe


	





Hi, 
 
You need gfs for the changes to appear on both servers. With GFS, when one server changes a file, the other server is made aware of the changes. Also GFS takes care of file locking. Also you need a fencedevice, so the cluster can shutdown a "defective" server, and make sure it dosn't corrupt the GFS. 
 
For your current setup: 
When you have both servers running, you could mount the ext3 fs on both servers, but only the server that writes a file, will be aware of it. The other server will be aware of the new file when you remount the fs. 
 
Hope this helps a bit. 
 
Kind Regards 
Kristoffer 
 
 


________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:01
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe


Greetings all, 

I am an old veteran of HP-UX Service Guard. I am trying to get a NFS linux Cluster working and need some advise. 

I have read the NFS Cookbook from Redhat and have a the following working 

2 Node Linux Cluster  (RHEL AS 4.0 update 5) 

one test disk LVM formated ext3 (have not decided onGFS or not) 

Use Vitual IPAddress to access the disks via NFS 

When SYSTEMA  runs the service and the client machine access the disk and makes changes. Then I fail over to SYSTEMB these changes made by the client are not present. 

I am runing CLVMD deamon 
The LVM disks are mounted on both systems 
I have made the following changes to LVM.CONF 

(I have tried locking_type = 2 and locking_type = 3) both have he same results. (as above) 

Sorry for my ignorance but can anyone tell me what I am doing wrong...would GFS solve the syncing problem? 




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster --
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/fb8b1399/attachment.htm>

From Sthistle at gov.nl.ca  Fri Jul 20 12:56:12 2007
From: Sthistle at gov.nl.ca (Thistle, Scott)
Date: Fri, 20 Jul 2007 10:26:12 -0230
Subject: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F3@exchsrv07.rootdom.dk>
References: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F1@exchsrv07.rootdom.dk><OF80385CA9.A66C2A9E-ON8525731E.00442E2F-8525731E.0044C56F@jmsmucker.com>
	<00B9BFA1C44A674794C9A1A4F5A22CA57B25F3@exchsrv07.rootdom.dk>
Message-ID: <506B469CC6211B49BE28F6AC56BBCA35013BF882@STJH-P102.PSNL.CA>

Another fence device is the IBM Bladecenter management console. I have two RHEL5 clusters running on BladeCenter, and this method of fencing will connect to the BC management console and reboot the other node via telnet. 

________________________________

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kristoffer Lippert
Sent: Friday, July 20, 2007 10:06 AM
To: linux clustering
Subject: SV: SV: [Linux-cluster] Linux Clustering Newbe


Hi,
 
A fence device is a device that can "build a fence" around a node, and thus keep it from corrupting a shared filesystem.
Most commonly i think a powerswitch is used. It simply cuts the power to the decfunct server.
It looks and works like this:
http://www.wti.com/guides/rpb115ug.htm
(but there are ofcourse loads of brands avaliabel - No, I don't work for WTI ;-)
 
Alternatively there is the option of "cutting" the network from the server (that would be the network between server and disks), using a switch. I have not tried that method of fencing, so someone else might be able to fill in. 

Depending on the topology of your cluster (are your disks connected thrugh a dedicated fiber net or for instance iscsi.) the need to "cut" the defective server off from the disks will be different.
 
So in short, A fence device can be many things. :-)
 
Hope it make sence
/Kristoffer
 

________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:26
Til: linux clustering
Emne: Re: SV: [Linux-cluster] Linux Clustering Newbe



Could you elaborate on the fence device.  What would suggest using ? 






"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com 

07/20/2007 08:13 AM 
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com> 
cc
Subject
SV: [Linux-cluster] Linux Clustering Newbe	

		




Hi, 
  
You need gfs for the changes to appear on both servers. With GFS, when one server changes a file, the other server is made aware of the changes. Also GFS takes care of file locking. Also you need a fencedevice, so the cluster can shutdown a "defective" server, and make sure it dosn't corrupt the GFS. 
  
For your current setup: 
When you have both servers running, you could mount the ext3 fs on both servers, but only the server that writes a file, will be aware of it. The other server will be aware of the new file when you remount the fs. 
  
Hope this helps a bit. 
  
Kind Regards 
Kristoffer 
  
  


________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 14:01
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe


Greetings all, 

I am an old veteran of HP-UX Service Guard. I am trying to get a NFS linux Cluster working and need some advise. 

I have read the NFS Cookbook from Redhat and have a the following working 

2 Node Linux Cluster  (RHEL AS 4.0 update 5) 

one test disk LVM formated ext3 (have not decided onGFS or not) 

Use Vitual IPAddress to access the disks via NFS 

When SYSTEMA  runs the service and the client machine access the disk and makes changes. Then I fail over to SYSTEMB these changes made by the client are not present. 

I am runing CLVMD deamon 
The LVM disks are mounted on both systems 
I have made the following changes to LVM.CONF 

(I have tried locking_type = 2 and locking_type = 3) both have he same results. (as above) 

Sorry for my ignorance but can anyone tell me what I am doing wrong...would GFS solve the syncing problem? 




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/176c93ea/attachment.htm>

From Dan.Askew at jmsmucker.com  Fri Jul 20 14:19:03 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Fri, 20 Jul 2007 10:19:03 -0400
Subject: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <506B469CC6211B49BE28F6AC56BBCA35013BF882@STJH-P102.PSNL.CA>
Message-ID: <OFC8170BCF.A1271744-ON8525731E.004CDBE8-8525731E.004F1A36@jmsmucker.com>

Ok Here is what I did to get it working:

used RHCS to manage the cluster services 
used CLVM  to manage the LVM metadata across cluster
        Coded a "locking_type = 3" in the LVM.CONF file

Used OCFS2 to cluster and manage the filesystems across cluster.

Configuration is 

2 node cluster runing nfs

vitural IPAddr service to allow clients to mount the drives via NFS 

All appears to be working fine. I can fail over the services manually or 
even boot a server and the failover works great.

Now the big question....is this supported by redhat and oracle ...hmmmm


Thanks to all who helped get this working...


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/33826833/attachment.htm>

From teigland at redhat.com  Fri Jul 20 16:16:20 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 20 Jul 2007 11:16:20 -0500
Subject: [Linux-cluster] Backups with GFS and RAM sizing
In-Reply-To: <4694BBEA020000C800002C86@gwia.health-first.org>
References: <4694BBEA020000C800002C86@gwia.health-first.org>
Message-ID: <20070720161620.GA4329@redhat.com>

On Wed, Jul 11, 2007 at 11:15:54AM -0400, Danny Wall wrote:
> My backups are taking a very long time, and I am looking for ways to
> optimize the process, so suggestions are welcome.

I suspect that a large part of this is just the inherent slowness of gfs.
Here are some ideas to try,
1. echo "0" > /proc/cluster/lock_dlm/drop_count
2. mount each fs with nolock during the backup (mount -o lockproto=lock_nolock)
3. unmount/remount each fs after the backup is done with it to get rid
   of the excessive unwanted caching effects from scanning the whole fs
4. use the glock_purge feature that's in 4.5 (which approximates 3 without
   the unmount)

Dave



From teigland at redhat.com  Fri Jul 20 16:28:30 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 20 Jul 2007 11:28:30 -0500
Subject: [Linux-cluster] strange slowness of ls with 1 newly created file
	on gfs 1 or 2
In-Reply-To: <46941A71.7010602@blackhole.sk>
References: <46941A71.7010602@blackhole.sk>
Message-ID: <20070720162830.GB4329@redhat.com>

On Wed, Jul 11, 2007 at 01:46:57AM +0200, Pavel Stano wrote:
> Hello,
> 
> i am testing gfs and its very slow, please look at this if it is normal
> or i miss something
> 
> i have 2 node cluster, nodes are connected via SAS to disk array promise
> e310s, when i run dd on attached block device i have cca 150MBps
> throughput on both nodes
> there is debian etch, i compile cluster-2.00.00 with gfs1 module
> i create one 475GB logical volume (i dont use clvmd, just normal lvm),
> create gfs1 on it
> gfs_mkfs -t cluster1:data0 -p lock_dlm -j 2 /dev/vgdata0/lvdata0
> mount that lv on both nodes to directory /d/0/, run df on both nodes
> and then run touch on node 1:
> serpico# touch /d/0/test
> 
> and ls on node 2:
> dinorscio:~# time ls /d/0/
> test
> 
> real    0m9.486s
> user    0m0.000s
> sys     0m0.004s
> 
> it took almost 10 seconds to display 1 file on that filesystem
> when i again create other file via touch(node1) and run ls (node2) it
> took again cca 10 seconds
> i monitor activity with dstat and there is 50% iowait on node where run
> ls (50% because 2 core cpu on node), but no disk activity
> nodes are connected via 1gbps idle ethernet
> 
> and when ls is runing, i look at wchan with ps
> ps axf -o pid,wchan:20,cmd|grep ls
> 6387 sync_buffer                               \_ ls --color=auto /d/0/
> i run ps many times there is still sync_buffer, i dont see other kernel
> function

Have you found the problem here?  10 seconds definately means something
has gone very wrong somewhere.  I don't think it's in the dlm or gfs,
though.  My guess is that the problem is in the i/o layer (sync_buffer is
waiting for a write to complete).

Could you partition your disk, create an ext3 fs on each partition,
have each node mount one of the ext3's, and run some tests?

Dave



From teigland at redhat.com  Fri Jul 20 16:41:25 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 20 Jul 2007 11:41:25 -0500
Subject: [Linux-cluster] Problem with fenced on cluster with 2 BladeCenter
	machines: 1st machine is remove physically. The remaining one
	does not became Active (waiting for fenced)
In-Reply-To: <1184248234.4533.6.camel@xw9300.bidmc.harvard.edu>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>
	<1184248234.4533.6.camel@xw9300.bidmc.harvard.edu>
Message-ID: <20070720164124.GC4329@redhat.com>

On Thu, Jul 12, 2007 at 09:50:34AM -0400, Robert Hurst wrote:
> On a related note, what is the correct value for clean_start in
> cluster.conf?
> 
> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>

clean_start should always be 0 (the default).

Dave



From teigland at redhat.com  Fri Jul 20 16:53:52 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 20 Jul 2007 11:53:52 -0500
Subject: [Linux-cluster] fenced segfault
In-Reply-To: <20070716143713.2AB9F47C8F@l00-bugdead-prods.de>
References: <20070716143713.2AB9F47C8F@l00-bugdead-prods.de>
Message-ID: <20070720165352.GD4329@redhat.com>

On Mon, Jul 16, 2007 at 04:37:12PM +0200, Sebastian Reitenbach wrote:
> <clusternode name="srv4" nodeid="1" votes="1">
> <clusternode name="srv5" nodeid="1" votes="1">
> any hint what could cause the segfault of the fenced?

Try different nodeids.

Dave



From teigland at redhat.com  Fri Jul 20 17:00:49 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 20 Jul 2007 12:00:49 -0500
Subject: [Linux-cluster] mount: not in default fence domain
In-Reply-To: <20070717143625.BC44A47EDC@l00-bugdead-prods.de>
References: <20070717143625.BC44A47EDC@l00-bugdead-prods.de>
Message-ID: <20070720170049.GE4329@redhat.com>

On Tue, Jul 17, 2007 at 04:36:25PM +0200, Sebastian Reitenbach wrote:
> Hi,
> 
> I am still on my openSUSE 10.2, x86_64, cluster-2.0.0. and openais 0.80.1 
> and kernel version 2.6.20.15.
> 
> I got so far, that my two node cluster tries to start when i issue 
> a /etc/init.d/cman start. When I only start one node, the fenced crashes. 
> But when I start both at the same time, the fenced stays alive, as mentioned 
> in an earlier mail in another thread. After about half an hour, I thought I 
> try mounting the GFS2 partition. The mount process hung on both machines 
> too. Then I killed the cman init script on one machine. The other detected 
> this, but then the mount commands ended too, with the following error 
> message:
> 
> Jul 17 15:28:06 srv5 gfs_controld[22012]: mount: not in default fence domain
> Jul 17 15:28:35 srv5 gfs_controld[22012]: mount: not in default fence domain
> 
> the fenced was running, but no fence_ilo, but I am also not sure, whether 
> this one has to run or not.
> 
> what does the error message above is trying to tell me? Can I fix this, or 
> will it "fix itself", when I am getting the cman script starting correctly?

Try again after you've fixed the nodeids.  If it's a problem, try starting
things up manually (without the init script), and use group_tool to verify
that the nodes have both properly joined the fence domain after
fence_tool join, and prior to the mount. (Follow usage.txt for manually
startup.)

Dave



From teigland at redhat.com  Fri Jul 20 17:27:53 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 20 Jul 2007 12:27:53 -0500
Subject: [Linux-cluster] GFS inodes
In-Reply-To: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>
References: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>
Message-ID: <20070720172753.GF4329@redhat.com>

On Wed, Jul 18, 2007 at 02:58:14PM +0200, Henke wrote:
> 
> Hi,
> 
> I have a problem using GFS 6.1 in a 4-node cluster.
> My scenario is as follows:
> 4 nodes share a 100GB SAN device.
> node 1 generates data, while nodes 2-4 only read that data (although the 
> gfs is mounted rw).
> The amount of shared data is ~ 3GB.
> node 1 creates new data and moves the old data. After the mv, the old 
> directory is removed.
> nodes 2-4 notice that the data has changed and restart.
> 
> However, I still wonder why the deleted inodes are never de-allocated.
> Without manual intervention (eg. "service gfs restart") on one node 1,
> the filesystem grows about 3GB/day although the actual data is still only
> ~3GB.
> So, as a workaround, I constantly restart the GFS from time to time on 
> node 1 and the inodes are de-allocated:
> 
> Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Scanning for 
> log elements...
> Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Found 92 
> unlinked inodes
> Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Found quota 
> changes for 0 IDs
> Jul 18 14:04:07 node1 kernel: GFS: fsid=cluster:lucene.0: Done
> 
> - I tried various parameters using gfs_tool, however, the deleted inodes 
> never get removed unless I restart the whole GFS on node 1 - is there any 
> way to circumvent this issue ?
> inoded_secs is at 15 and neither gfs_tool reclaim nor gfs_tool shrink show 
> any inodes reclaimed.

This is strange, after being unlinked, gfs_inoded should deallocate them,
but apparently it's not.  What release are you using?  Could you record
the inode numbers of the files before they're unlinked next time (ls -li)?
Then, after they're unlinked, do 'gfs_tool counters' (looking for unlinked
count), and 'gfs_tool lockdump' and send this info?  I'd like to see what
state the iopen locks are in for the unlinked files.  Also send the output
of 'gfs_tool gettune'.
Thanks,
Dave



From Danny.Wall at health-first.org  Fri Jul 20 17:32:40 2007
From: Danny.Wall at health-first.org (Danny Wall)
Date: Fri, 20 Jul 2007 13:32:40 -0400
Subject: [Linux-cluster] Backups with GFS and RAM sizing
Message-ID: <46A0B978020000C8000030A1@gwia.health-first.org>

Thanks for the information. I am a little hesitant about mounting with
the no_lock option during backups, since the filesystem will be mounted
on multiple nodes. Currently, only one node writes to the filesystem at
a time, then the backups run on the other node, but I imagine there
could still be locking problems between the two.

The /proc/cluster/lock_dlm/drop_count was 50000. After echo the 0, it
stayed at 0 for a while, and has not gone up yet. I'm not sure what it
should be, but it did not make a difference on its own. I guess I will
have to research these.

I will add a remount at the end of backups to see if that helps any. We
might be updating later this year, so hopefully this would address #3 at
that time.

Thanks

On Fri, 2007-07-20 at 11:16 -0500, David Teigland wrote:
> On Wed, Jul 11, 2007 at 11:15:54AM -0400, Danny Wall wrote:
> > My backups are taking a very long time, and I am looking for ways to
> > optimize the process, so suggestions are welcome.
> 
> I suspect that a large part of this is just the inherent slowness of gfs.
> Here are some ideas to try,
> 1. echo "0" > /proc/cluster/lock_dlm/drop_count
> 2. mount each fs with nolock during the backup (mount -o lockproto=lock_nolock)
> 3. unmount/remount each fs after the backup is done with it to get rid
>    of the excessive unwanted caching effects from scanning the whole fs
> 4. use the glock_purge feature that's in 4.5 (which approximates 3 without
>    the unmount)
> 
> Dave
> 
> 
> 






From teigland at redhat.com  Fri Jul 20 17:44:38 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 20 Jul 2007 12:44:38 -0500
Subject: [Linux-cluster] Backups with GFS and RAM sizing
In-Reply-To: <46A0B978020000C8000030A1@gwia.health-first.org>
References: <46A0B978020000C8000030A1@gwia.health-first.org>
Message-ID: <20070720174438.GG4329@redhat.com>

On Fri, Jul 20, 2007 at 01:32:40PM -0400, Danny Wall wrote:
> Thanks for the information. I am a little hesitant about mounting with
> the no_lock option during backups, since the filesystem will be mounted
> on multiple nodes. Currently, only one node writes to the filesystem at
> a time, then the backups run on the other node, but I imagine there
> could still be locking problems between the two.

I was suggesting just having the backup node mount the fs during the
backup, unmounting from all others (assuming they could do without it).

> The /proc/cluster/lock_dlm/drop_count was 50000. After echo the 0, it
> stayed at 0 for a while, and has not gone up yet. I'm not sure what it
> should be, but it did not make a difference on its own. I guess I will
> have to research these.

It's a config setting so it won't change.  A non-zero value limits the
lock caching gfs can do which limits performance.  In the release you're
using I believe you need to set it before mounting for it to have any
impact on the fs.

> I will add a remount at the end of backups to see if that helps any. We
> might be updating later this year, so hopefully this would address #3 at
> that time.

Just to be clear, mount -o remount isn't sufficient to clear out the
unwanted cache, an unmount is required followed by a mount.

Dave



From Danny.Wall at health-first.org  Fri Jul 20 18:19:38 2007
From: Danny.Wall at health-first.org (Danny Wall)
Date: Fri, 20 Jul 2007 14:19:38 -0400
Subject: [Linux-cluster] Backups with GFS and RAM sizing
Message-ID: <46A0C47A020000C8000030B0@gwia.health-first.org>

>I was suggesting just having the backup node mount the fs during the
> backup, unmounting from all others (assuming they could do without it).

Unfortunately, this is a cardiology imaging system, so it can not be
taken down for backups.

>Just to be clear, mount -o remount isn't sufficient to clear out the
> unwanted cache, an unmount is required followed by a mount.

Thanks for clarifying. That would have thrown me.

Danny

On Fri, 2007-07-20 at 12:44 -0500, David Teigland wrote:
> On Fri, Jul 20, 2007 at 01:32:40PM -0400, Danny Wall wrote:
> > Thanks for the information. I am a little hesitant about mounting with
> > the no_lock option during backups, since the filesystem will be mounted
> > on multiple nodes. Currently, only one node writes to the filesystem at
> > a time, then the backups run on the other node, but I imagine there
> > could still be locking problems between the two.
> 
> I was suggesting just having the backup node mount the fs during the
> backup, unmounting from all others (assuming they could do without it).
> 
> > The /proc/cluster/lock_dlm/drop_count was 50000. After echo the 0, it
> > stayed at 0 for a while, and has not gone up yet. I'm not sure what it
> > should be, but it did not make a difference on its own. I guess I will
> > have to research these.
> 
> It's a config setting so it won't change.  A non-zero value limits the
> lock caching gfs can do which limits performance.  In the release you're
> using I believe you need to set it before mounting for it to have any
> impact on the fs.
> 
> > I will add a remount at the end of backups to see if that helps any. We
> > might be updating later this year, so hopefully this would address #3 at
> > that time.
> 
> Just to be clear, mount -o remount isn't sufficient to clear out the
> unwanted cache, an unmount is required followed by a mount.
> 
> Dave
> 
> 
> 






From bfilipek at crscold.com  Fri Jul 20 20:55:18 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Fri, 20 Jul 2007 15:55:18 -0500
Subject: [Linux-cluster] fencing failing
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D026CEDCD@SRVEDI.upark.crscold.com>

I have an APC MasterSwitch as my fencing device. I configured my cluster
to use "APC" as the fencing device, and have confirmed that it has the
correct un, pw, and IP address configured. However, when it tries to
reboot a failed node, I get this in /var/log/messages:

 

Jul 20 15:51:28 server1 fenced[32169]: agent "fence_apc" reports:
failed: unrecognised menu response

Jul 20 15:51:28 server1 fenced[32169]: fence "server2.my.domain.com"
failed

 

However, when I run this command from a terminal, it runs fine and the
failed node reboots:

 

fence_apc -a 192.168.1.61 -l ***** -p ***** -n 6 -v

 

 

 

Brad Filipek



 


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070720/e41988e9/attachment.htm>

From jparsons at redhat.com  Fri Jul 20 21:10:55 2007
From: jparsons at redhat.com (jim parsons)
Date: Fri, 20 Jul 2007 17:10:55 -0400
Subject: [Linux-cluster] fencing failing
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D026CEDCD@SRVEDI.upark.crscold.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D026CEDCD@SRVEDI.upark.crscold.com>
Message-ID: <1184965855.3019.31.camel@localhost.localdomain>

On Fri, 2007-07-20 at 15:55 -0500, Brad Filipek wrote:
> I have an APC MasterSwitch as my fencing device. I configured my
> cluster to use ?APC? as the fencing device, and have confirmed that it
> has the correct un, pw, and IP address configured. However, when it
> tries to reboot a failed node, I get this in /var/log/messages:
> 
>  
> 
> Jul 20 15:51:28 server1 fenced[32169]: agent "fence_apc" reports:
> failed: unrecognised menu response
> 
> Jul 20 15:51:28 server1 fenced[32169]: fence "server2.my.domain.com"
> failed
> 
>  
> 
> However, when I run this command from a terminal, it runs fine and the
> failed node reboots:
> 
>  
> 
> fence_apc -a 192.168.1.61 ?l ***** -p ***** -n 6 ?v

Ooohh...that is not good. Can you please tell me if this is rhel4 or rhel5?

Can you send your cluster.conf file? If the agent works from the command
line bu not within the cluster code, it could be an error in the conf
file. XXX out all passwords and such that you care about, or course,
before sending to list.

 Can you telnet into the apc switch and see what firmware version it is using?
There are two version values on the welcome screen that would be nice to
know:

  Network Management Card AOS      vx.x.x
  Rack PDU APP                     vx.x.x

-J




From bfilipek at crscold.com  Fri Jul 20 21:26:35 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Fri, 20 Jul 2007 16:26:35 -0500
Subject: [Linux-cluster] fencing failing
References: <9C01E18EF3BC2448A3B1A4812EB87D026CEDCD@SRVEDI.upark.crscold.com>
	<1184965855.3019.31.camel@localhost.localdomain>
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D026CEDEB@SRVEDI.upark.crscold.com>

Hi Jim,

This is in rhel5

APC Firmware:
=======================================================================
Network Management Card AOS      v2.6.4
MSP APP                          v2.6.2
=======================================================================

cluster.conf file:
=======================================================================
<?xml version="1.0"?>
<cluster alias="cluster1" config_version="20" name="cluster1">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="server1.my.domain.com" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APCMS62" port="7"
switch="0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="server2.my.domain.com" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APCMS62" port="6"
switch="0"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.61"
login="name" name="APCMS61" passwd="pass"/>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.62"
login="name" name="APCMS62" passwd="pass"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="main" ordered="0"
restricted="0">
                                <failoverdomainnode
name="server1.my.domain.com" priority="1"/>
                                <failoverdomainnode
name="server2.my.domain.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <smb name="smb1" workgroup="WKGRP"/>
                        <ip address="192.168.1.20" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="main" name="samba">
                        <smb ref="smb"/>
                        <ip ref="192.168.1.20"/>
                </service>
        </rm>
</cluster>
=======================================================================



Brad Filipek


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of jim parsons
Sent: Friday, July 20, 2007 4:11 PM
To: linux clustering
Subject: Re: [Linux-cluster] fencing failing

On Fri, 2007-07-20 at 15:55 -0500, Brad Filipek wrote:
> I have an APC MasterSwitch as my fencing device. I configured my
> cluster to use "APC" as the fencing device, and have confirmed that it
> has the correct un, pw, and IP address configured. However, when it
> tries to reboot a failed node, I get this in /var/log/messages:
> 
>  
> 
> Jul 20 15:51:28 server1 fenced[32169]: agent "fence_apc" reports:
> failed: unrecognised menu response
> 
> Jul 20 15:51:28 server1 fenced[32169]: fence "server2.my.domain.com"
> failed
> 
>  
> 
> However, when I run this command from a terminal, it runs fine and the
> failed node reboots:
> 
>  
> 
> fence_apc -a 192.168.1.61 -l ***** -p ***** -n 6 -v

Ooohh...that is not good. Can you please tell me if this is rhel4 or
rhel5?

Can you send your cluster.conf file? If the agent works from the command
line bu not within the cluster code, it could be an error in the conf
file. XXX out all passwords and such that you care about, or course,
before sending to list.

 Can you telnet into the apc switch and see what firmware version it is
using?
There are two version values on the welcome screen that would be nice to
know:

  Network Management Card AOS      vx.x.x
  Rack PDU APP                     vx.x.x

-J


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.




From chris at cmiware.com  Sat Jul 21 20:41:04 2007
From: chris at cmiware.com (Chris Harms)
Date: Sat, 21 Jul 2007 15:41:04 -0500
Subject: [Linux-cluster] [UPDATE] IP monitor failing periodically
Message-ID: <46A26F60.8020402@cmiware.com>

We reinstalled our machines with RHEL 5 x86_64 (we were running i386) a 
few weeks ago and the mysterious IP monitoring failures have disappeared. 
I believe it was postulated that a compiler bug regarding -fpie might be 
causing segfaults in i386 binaries, so this would support that theory to 
some degree, although I did not really attempt to confirm it further.  I 
thought the architecture change fixing the random failovers was noteworthy.

### previous thread below

Hi Chris,

I am experiencing the same problem on RHEL 5 and I have a support 
request in with RedHat.

I was asked to increase the debug level by changing the <rm> line in the 
cluster configuration to:

<rm log_facility="local4" log_level="7">

I then needed to add "local4.* /var/log/cluster" to /etc/syslog.conf and 
run "service syslog restart".

To update the cluster configuration I needed to propagate the cluster 
configuration to both nodes:

# ccs_tool update /etc/cluster/cluster.conf

After a week I have not had the problem with the increased logging 
despite the problem occurring regularly prior to that - 2 to 3 times a 
day. One day last week out of curiosity I reverted to the default 
settings and within a few hours I had the failure to ping error on one 
of the clustered IP addresses and the service was restarted.

I now have the logging back at 7 and the support request is pending.

Regards
-- 
David Schroeder
Server Support
Information Services Division
Flinders University
Adelaide, Australia
Ph: +61 8 8201 2689


Chris Harms wrote:
> I am experiencing periodic failovers due to a floating IP address not 
> passing the status check:
> 
> clurgmgrd: [9975]: <warning> Failed to ping 192.168.13.204
> Jun 30 11:41:47 nodeA clurgmgrd[9975]: <notice> status on ip 
> "192.168.13.204" returned 1 (generic error)
> 
> Both nodes have bonded NICs with gigabit connections to redundant 
> switches, so it is unlikely they are going down, nothing in the logs 
> about linux losing the links.  I parked all the cluster services - 2 
> Postgres services and 1 Apache - on one node and allowed it to run 
> overnight.  There would be no client activity during this time. One 
> Postgres service failed two times in this manner and the other failed 
> once in this manner.  The Apache service did not fail.
> 
> What can I do to resolve this or get more information out of the system 
> to resolve this?
> 
> Thanks in advance,
> Chris
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From aravind.parchuri at gmail.com  Sun Jul 22 05:08:01 2007
From: aravind.parchuri at gmail.com (Aravind Parchuri)
Date: Sat, 21 Jul 2007 22:08:01 -0700
Subject: [Linux-cluster] Tools always return EXIT_FAILURE
Message-ID: <46A2E631.6020108@gmail.com>

I hope I'm not missing something obvious, but could someone tell me why 
ccs_tool exits with "EXIT_FAILURE" even when it's completed the 
requested operation successfully?


    else if(!strcmp(argv[optind], "addnode")){
	    add_node(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }
     else if(!strcmp(argv[optind], "delnode")){
	    del_node(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }
     else if(!strcmp(argv[optind], "addfence")){
	    add_fence(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }
     else if(!strcmp(argv[optind], "delfence")){
	    del_fence(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }
     else if(!strcmp(argv[optind], "lsnode")){
	    list_nodes(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }
     else if(!strcmp(argv[optind], "lsfence")){
	    list_fences(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }
     else if(!strcmp(argv[optind], "create")){
	    create_skeleton(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }
     else if(!strcmp(argv[optind], "addnodeids")){
	    add_nodeids(argc-1, argv+1);
	    exit(EXIT_FAILURE);
     }

     else {
       fprintf(stderr, "Unknown command, %s.\n"
	      "Try 'ccs_tool help' for help.\n", argv[optind]);
       exit(EXIT_FAILURE);
     }

It's a bit of a bother that any automation working with these tools has 
to always parse the output - I cannot put in any sort of future-proof 
error-checking that just looks at the exit code.

Thanks,
Aravind.



From henker at evendi.de  Mon Jul 23 11:12:22 2007
From: henker at evendi.de (Steffan Henke)
Date: Mon, 23 Jul 2007 13:12:22 +0200
Subject: [Linux-cluster] GFS inodes
In-Reply-To: <20070720172753.GF4329@redhat.com>
References: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>
	<20070720172753.GF4329@redhat.com>
Message-ID: <Pine.WNT.4.64.0707231131530.2728@henker.shcom.us>

On Fri, 20 Jul 2007, David Teigland wrote:

> This is strange, after being unlinked, gfs_inoded should deallocate them,
> but apparently it's not.  What release are you using?  Could you record

Hello,

I am using EL 4.5 with kernel 2.6.9-55.ELsmp.

> the inode numbers of the files before they're unlinked next time (ls -li)?

This is my current dir prior to the mv:

total 2222900
  590019 -rw-r--r--  1 2271684518 Jul 23 06:57 _2cqqt.cfs
4587612 -rw-r--r--  1          4 Jul 23 06:57 deletable
4587613 -rw-r--r--  1         31 Jul 23 06:57 segments

> Then, after they're unlinked, do 'gfs_tool counters' (looking for unlinked
> count), and 'gfs_tool lockdump' and send this info?  I'd like to see what

gfs_tool counters /gfs

                                   locks 426
                              locks held 32
                            freeze count 0
                           incore inodes 10
                        metadata buffers 221
                         unlinked inodes 129
                               quota IDs 2
                      incore log buffers 0
                          log space used 0.05%
               meta header cache entries 0
                      glock dependencies 0
                  glocks on reclaim list 0
                               log wraps 1
                    outstanding LM calls 0
                   outstanding BIO calls 0
                        fh2dentry misses 0
                        glocks reclaimed 1914903
                          glock nq calls 3738433
                          glock dq calls 3641879
                    glock prefetch calls 1120342
                           lm_lock calls 1236087
                         lm_unlock calls 1139499
                            lm callbacks 2375643
                      address operations 1700121
                       dentry operations 44614
                       export operations 0
                         file operations 22763
                        inode operations 2365329
                        super operations 2265685
                           vm operations 0
                         block I/O reads 582429
                        block I/O writes 576463

gfs_tool lockdump returns an error message on all cluster nodes:
gfs_tool: unknown mountpoint /gfs

> state the iopen locks are in for the unlinked files.  Also send the output
> of 'gfs_tool gettune'.

gfs_tool gettune /gfs
ilimit1 = 10
ilimit1_tries = 3
ilimit1_min = 10
ilimit2 = 10
ilimit2_tries = 10
ilimit2_min = 3
demote_secs = 300
incore_log_blocks = 1024
jindex_refresh_secs = 60
depend_secs = 60
scand_secs = 5
recoverd_secs = 60
logd_secs = 1
quotad_secs = 5
inoded_secs = 15
glock_purge = 0
quota_simul_sync = 64
quota_warn_period = 10
atime_quantum = 3600
quota_quantum = 60
quota_scale = 1.0000   (1, 1)
quota_enforce = 1
quota_account = 1
new_files_jdata = 0
new_files_directio = 0
max_atomic_write = 4194304
max_readahead = 262144
lockdump_size = 131072
stall_secs = 600
complain_secs = 10
reclaim_limit = 50
entries_per_readdir = 32
prefetch_secs = 10
statfs_slots = 64
max_mhc = 10000
greedy_default = 100
greedy_quantum = 25
greedy_max = 250
rgrp_try_threshold = 100
statfs_fast = 0

A gfs_tool reclaim returns *some* metadata, but no inode at all, while a
service gfs restart shows

Jul 23 09:54:59 host kernel: GFS: fsid=gfs:lucene.0: Scanning for 
log elements...
Jul 23 09:54:59 host kernel: GFS: fsid=gfs:lucene.0: Found 130 
unlinked inodes

- however, they are not de-allocated unless I restart GFS on some/all 
cluster nodes...

Regards,

Steffan



From doc at mts.com.ua  Mon Jul 23 13:37:42 2007
From: doc at mts.com.ua (Eugene Melnichuk)
Date: Mon, 23 Jul 2007 16:37:42 +0300
Subject: [Linux-cluster] Hang on start fence_tool join with qdisk
Message-ID: <46A4AF26.20204@mts.com.ua>



Hi,

I have problem with my cluster running on RHEL5 + updates from  
http://people.redhat.com/lhh/rhel5-test/  

I have 2 node cluster with shared quorum disk, qdiskd is running, but 
when I start service cman I hang on Starting fencing.
In my logs I have messages about regained qourum :

Jul 21 15:50:18 arf-web1 qdiskd[7326]: <info> Assuming master role
Jul 21 15:50:19 arf-web1 ccsd[8188]: Cluster is not quorate.  Refusing 
connection.
Jul 21 15:50:19 arf-web1 ccsd[8188]: Error while processing connect: 
Connection refused
Jul 21 15:50:19 arf-web1 openais[8200]: [CMAN ] quorum regained, 
resuming activity
Jul 21 15:50:20 arf-web1 clurgmgrd[7746]: <notice> Quorum formed, starting
Jul 21 15:50:20 arf-web1 kernel: dlm: no local IP address has been set
Jul 21 15:50:20 arf-web1 kernel: dlm: cannot start dlm lowcomms -12


After few minutes process of starting fencing finished , but I still do 
not have running services and in group_tool I see that joining to fence 
domain is not complete.

[root at arf-web1 ~]# group_tool
type             level name     id       state
fence            0     default  00010002 JOIN_START_WAIT
[2]

When I try issue commands like cman_tool or clustat I got nothing and 
hang on access to socket /var/run/cman_client (but can Ctrl-C running 
command)
[root at arf-web1 ~]# strace cman_tool status
execve("/usr/sbin/cman_tool", ["cman_tool", "status"], [/* 21 vars */]) = 0
<skip>
socket(PF_FILE, SOCK_STREAM, 0)         = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
connect(3, {sa_family=AF_FILE, path="/var/run/cman_client"}, 110 
<unfinished ...>

[root at arf-web1 ~]# strace clustat
execve("/usr/sbin/clustat", ["clustat"], [/* 21 vars */]) = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
connect(3, {sa_family=AF_FILE, path="/var/run/cman_client"}, 110 
<unfinished ...>

What can I do to resolve this ?


Thanks in advance,
Eugene



--
Eugene Melnichuk
Lead Engineer
email: doc at umc.ua <mailto:doc at umc.ua>
mob: +380503304043
pbx: +380501105731
CJSC Ukrainian Mobile Communications
49/2 Pobedy ave., room 4.26, 03680, Kyiv, Ukraine


-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: text/xml
Size: 2342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070723/9ae21293/attachment.xml>

From bfilipek at crscold.com  Mon Jul 23 18:33:28 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Mon, 23 Jul 2007 13:33:28 -0500
Subject: [Linux-cluster] fencing failing
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D02701A5C@SRVEDI.upark.crscold.com>


I have an APC MasterSwitch as my fencing device. I configured my cluster
to use "APC" as the fencing device, and have confirmed that it has the
correct un, pw, and IP address configured. However, when it tries to
reboot a failed node, I get this in /var/log/messages:

Jul 20 15:51:28 server1 fenced[32169]: agent "fence_apc" reports:
failed: unrecognised menu response

Jul 20 15:51:28 server1 fenced[32169]: fence "server2.my.domain.com"
failed

However, when I run this command from a terminal, it runs fine and the
failed node reboots:

fence_apc -a 192.168.1.61 -l ***** -p ***** -n 6 -v


This is in rhel5

APC Firmware:
=======================================================================
Network Management Card AOS      v2.6.4
MSP APP                          v2.6.2
=======================================================================

cluster.conf file:
=======================================================================
<?xml version="1.0"?>
<cluster alias="cluster1" config_version="20" name="cluster1">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="server1.my.domain.com" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APCMS62" port="7"
switch="0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="server2.my.domain.com" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APCMS62" port="6"
switch="0"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.61"
login="name" name="APCMS61" passwd="pass"/>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.62"
login="name" name="APCMS62" passwd="pass"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="main" ordered="0"
restricted="0">
                                <failoverdomainnode
name="server1.my.domain.com" priority="1"/>
                                <failoverdomainnode
name="server2.my.domain.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <smb name="smb1" workgroup="WKGRP"/>
                        <ip address="192.168.1.20" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="main" name="samba">
                        <smb ref="smb"/>
                        <ip ref="192.168.1.20"/>
                </service>
        </rm>
</cluster>
=======================================================================


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.




From teigland at redhat.com  Mon Jul 23 19:08:28 2007
From: teigland at redhat.com (David Teigland)
Date: Mon, 23 Jul 2007 14:08:28 -0500
Subject: [Linux-cluster] GFS inodes
In-Reply-To: <Pine.WNT.4.64.0707231131530.2728@henker.shcom.us>
References: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>
	<20070720172753.GF4329@redhat.com>
	<Pine.WNT.4.64.0707231131530.2728@henker.shcom.us>
Message-ID: <20070723190828.GD5760@redhat.com>

On Mon, Jul 23, 2007 at 01:12:22PM +0200, Steffan Henke wrote:
> gfs_tool lockdump returns an error message on all cluster nodes:
> gfs_tool: unknown mountpoint /gfs

Maybe gfs is on an odd device that gfs_tool can't match up with the
mountpoint.  Try this instead:

# gfs_tool list
18446675904036380672 sdb foo:x.0

# gfs_tool lockdump 18446675904036380672 > lockdump.txt

> >state the iopen locks are in for the unlinked files.  Also send the output
> >of 'gfs_tool gettune'.
> 
> gfs_tool gettune /gfs
> ilimit1 = 10
> ilimit1_tries = 3
> ilimit1_min = 10
> ilimit2 = 10
> ilimit2_tries = 10
> ilimit2_min = 3

I can't really say what effect these non-default values would have.  Could
you try the default ilimits?

> A gfs_tool reclaim returns *some* metadata, but no inode at all, while a
> service gfs restart shows

reclaim converts deallocated metadata blocks into normal free blocks.
Your problem is that inodes are never being deallocated in the first place
after being unlinked.

Dave



From aravind.parchuri at gmail.com  Mon Jul 23 19:23:11 2007
From: aravind.parchuri at gmail.com (Aravind Parchuri)
Date: Mon, 23 Jul 2007 12:23:11 -0700
Subject: [Linux-cluster] [UPDATE] IP monitor failing periodically
In-Reply-To: <46A26F60.8020402@cmiware.com>
References: <46A26F60.8020402@cmiware.com>
Message-ID: <46A5001F.80600@gmail.com>

I'm not sure about the segfaults, but we are facing the same issues on 
RHEL5 and FC6, i368 - random failovers due to ip-check failures. This 
workaround seems to help, for now at least:

http://www.redhat.com/archives/linux-cluster/2006-March/msg00329.html

I'll check if it is indeed the ping segfaulting and report back when I 
get some time.

Aravind.

chris at cmiware.com wrote:
> We reinstalled our machines with RHEL 5 x86_64 (we were running i386) a 
> few weeks ago and the mysterious IP monitoring failures have 
> disappeared. I believe it was postulated that a compiler bug regarding 
> -fpie might be causing segfaults in i386 binaries, so this would support 
> that theory to some degree, although I did not really attempt to confirm 
> it further.  I thought the architecture change fixing the random 
> failovers was noteworthy.
> 
> ### previous thread below
> 
> Hi Chris,
> 
> I am experiencing the same problem on RHEL 5 and I have a support 
> request in with RedHat.
> 
> I was asked to increase the debug level by changing the <rm> line in the 
> cluster configuration to:
> 
> <rm log_facility="local4" log_level="7">
> 
> I then needed to add "local4.* /var/log/cluster" to /etc/syslog.conf and 
> run "service syslog restart".
> 
> To update the cluster configuration I needed to propagate the cluster 
> configuration to both nodes:
> 
> # ccs_tool update /etc/cluster/cluster.conf
> 
> After a week I have not had the problem with the increased logging 
> despite the problem occurring regularly prior to that - 2 to 3 times a 
> day. One day last week out of curiosity I reverted to the default 
> settings and within a few hours I had the failure to ping error on one 
> of the clustered IP addresses and the service was restarted.
> 
> I now have the logging back at 7 and the support request is pending.
> 
> Regards



From aravind.parchuri at gmail.com  Mon Jul 23 19:31:44 2007
From: aravind.parchuri at gmail.com (Aravind Parchuri)
Date: Mon, 23 Jul 2007 12:31:44 -0700
Subject: [Linux-cluster] fencing failing
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D02701A5C@SRVEDI.upark.crscold.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D02701A5C@SRVEDI.upark.crscold.com>
Message-ID: <46A50220.2050902@gmail.com>

bfilipek at crscold.com wrote:
> I have an APC MasterSwitch as my fencing device. I configured my cluster
> to use "APC" as the fencing device, and have confirmed that it has the
> correct un, pw, and IP address configured. However, when it tries to
> reboot a failed node, I get this in /var/log/messages:
> 
> Jul 20 15:51:28 server1 fenced[32169]: agent "fence_apc" reports:
> failed: unrecognised menu response
> 
We faced the same problem in FC6, with an APC 7900 switch.
> Jul 20 15:51:28 server1 fenced[32169]: fence "server2.my.domain.com"
> failed
> 
> However, when I run this command from a terminal, it runs fine and the
> failed node reboots:
> 
> fence_apc -a 192.168.1.61 -l ***** -p ***** -n 6 -v

In our case, even running it from the command line didn't work. The rpms 
in the repo have the old perl script - probably the case with the RHEL5 
rpms too. The python script in CVS seems to work fine though:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/apc/fence_apc.py?rev=1.5&content-type=text/x-cvsweb-markup&cvsroot=cluster

Try replacing /sbin/fence_apc with the python script and see if it helps.

> 
> 
> This is in rhel5
> 
> APC Firmware:
> =======================================================================
> Network Management Card AOS      v2.6.4
> MSP APP                          v2.6.2
> =======================================================================
> 
> cluster.conf file:
> =======================================================================
> <?xml version="1.0"?>
> <cluster alias="cluster1" config_version="20" name="cluster1">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="server1.my.domain.com" nodeid="1"
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="APCMS62" port="7"
> switch="0"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="server2.my.domain.com" nodeid="2"
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="APCMS62" port="6"
> switch="0"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_apc" ipaddr="192.168.1.61"
> login="name" name="APCMS61" passwd="pass"/>
>                 <fencedevice agent="fence_apc" ipaddr="192.168.1.62"
> login="name" name="APCMS62" passwd="pass"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="main" ordered="0"
> restricted="0">
>                                 <failoverdomainnode
> name="server1.my.domain.com" priority="1"/>
>                                 <failoverdomainnode
> name="server2.my.domain.com" priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <smb name="smb1" workgroup="WKGRP"/>
>                         <ip address="192.168.1.20" monitor_link="1"/>
>                 </resources>
>                 <service autostart="1" domain="main" name="samba">
>                         <smb ref="smb"/>
>                         <ip ref="192.168.1.20"/>
>                 </service>
>         </rm>
> </cluster>
> =======================================================================
> 
> 
> Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 
> 
> If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jparsons at redhat.com  Mon Jul 23 20:20:23 2007
From: jparsons at redhat.com (jim parsons)
Date: Mon, 23 Jul 2007 16:20:23 -0400
Subject: [Linux-cluster] fencing failing
In-Reply-To: <46A50220.2050902@gmail.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D02701A5C@SRVEDI.upark.crscold.com>
	<46A50220.2050902@gmail.com>
Message-ID: <1185222023.3045.8.camel@localhost.localdomain>

On Mon, 2007-07-23 at 12:31 -0700, Aravind Parchuri wrote:
> bfilipek at crscold.com wrote:
> > I have an APC MasterSwitch as my fencing device. I configured my cluster
> > to use "APC" as the fencing device, and have confirmed that it has the
> > correct un, pw, and IP address configured. However, when it tries to
> > reboot a failed node, I get this in /var/log/messages:
> > 
> > Jul 20 15:51:28 server1 fenced[32169]: agent "fence_apc" reports:
> > failed: unrecognised menu response
> > 
> We faced the same problem in FC6, with an APC 7900 switch.
> > Jul 20 15:51:28 server1 fenced[32169]: fence "server2.my.domain.com"
> > failed
> > 
> > However, when I run this command from a terminal, it runs fine and the
> > failed node reboots:
> > 
> > fence_apc -a 192.168.1.61 -l ***** -p ***** -n 6 -v
> 
> In our case, even running it from the command line didn't work. The rpms 
> in the repo have the old perl script - probably the case with the RHEL5 
> rpms too. The python script in CVS seems to work fine though:
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/apc/fence_apc.py?rev=1.5&content-type=text/x-cvsweb-markup&cvsroot=cluster
> 
> Try replacing /sbin/fence_apc with the python script and see if it helps.
Indeed, the apc version of the agent works much better and has a more
sane way of matching screens. The perl agent was getting downright
undecipherable - even to grizzly old perl veterans :)

There is a new version checked into RHEL5 Head.

I am attaching it here so you have it quick - it handles firmware v3

-J
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence_apc
Type: text/x-python
Size: 25560 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070723/7774b1ec/attachment.py>

From aravind.parchuri at gmail.com  Mon Jul 23 20:26:40 2007
From: aravind.parchuri at gmail.com (Aravind Parchuri)
Date: Mon, 23 Jul 2007 13:26:40 -0700
Subject: [Linux-cluster] ccs_tool addnode with multiple fence devices
Message-ID: <46A50F00.3090004@gmail.com>

We have a couple of nodes with redundant power supplies and two APC 7900 
power switches that we need to connect them to. I can do ccs_tool 
addfence to add both switches to the cluster.conf file, but is there any 
way to specify both switches for each node, from the command line? 
Thanks in advance.

Aravind.



From aravind.parchuri at gmail.com  Tue Jul 24 02:10:02 2007
From: aravind.parchuri at gmail.com (Aravind Parchuri)
Date: Mon, 23 Jul 2007 19:10:02 -0700
Subject: [Linux-cluster] Command-line tools to add resources,
	services to cluster.conf
Message-ID: <46A55F7A.8020604@gmail.com>

It looks like ccs_tool only does nodes and fence devices. Is there some 
other tool to add/delete resources, failover domains and services to 
cluster.conf from the command line?

I guess it's simple enough to generate the xml, but it would be better 
if there were some standard way of doing it. Thanks in advance.

Aravind.



From henker at evendi.de  Tue Jul 24 07:12:16 2007
From: henker at evendi.de (Steffan Henke)
Date: Tue, 24 Jul 2007 09:12:16 +0200
Subject: [Linux-cluster] GFS inodes
In-Reply-To: <20070723190828.GD5760@redhat.com>
References: <Pine.WNT.4.64.0707181439460.3356@henker.shcom.us>
	<20070720172753.GF4329@redhat.com>
	<Pine.WNT.4.64.0707231131530.2728@henker.shcom.us>
	<20070723190828.GD5760@redhat.com>
Message-ID: <Pine.WNT.4.64.0707240907290.3584@henker.shcom.us>

On Mon, 23 Jul 2007, David Teigland wrote:

> # gfs_tool list
> 18446675904036380672 sdb foo:x.0
>
> # gfs_tool lockdump 18446675904036380672 > lockdump.txt

I send the lockdump.txt off-list as it's rather large...

> I can't really say what effect these non-default values would have.  Could
> you try the default ilimits?

OK, I have the default values back again. I was hoping that, by tweaking 
the ilimit* and reclaim_limit settings, I could speed up the de-allocation 
of the deleted inodes.

Regards,

Steffan



From sebastia at l00-bugdead-prods.de  Tue Jul 24 07:21:22 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 24 Jul 2007 09:21:22 +0200
Subject: [Linux-cluster] gfs2 and linux-ha
Message-ID: <20070724072122.DAA3D48847@l00-bugdead-prods.de>

Hi,

I got the gfs2 running in a two node cluster, thanks for the help from the 
list.

Do I can manage the gfs2 volumes with linux-ha? Do I can make the gfs2 aware 
that heartbeat is used and fencing is done by linux-ha instead of the rh 
cluster suite? From ocfs2 I know that I can tell the filesystem to use the 
linux-ha heartbeat instead of its own, but I don't know how to do this for 
gfs2, if it is possible at all.

kind regards
Sebastian



From sebastia at l00-bugdead-prods.de  Tue Jul 24 07:37:03 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 24 Jul 2007 09:37:03 +0200
Subject: [Linux-cluster] dlm_sendd and dlm_sendd running on 100% CPU
Message-ID: <20070724073704.4066C48865@l00-bugdead-prods.de>

Hi,

after getting the cluster up and running, and mounting the gfs2 partitions. 
With one mounted gfs2 partition, the dlm_sendd and dlm_recvd both consume 
100% CPU. When I umount the partition, the CPU usage drops to zero.

While at 100%, there are no message popping up in /var/log/messages.
Any idea what could be the reason for the CPU usage?

I use kernel 2.6.20.15 on x86_64 openSUSE 10.2, openais-0.80.1, and 
cluster-2.00.00.

kind regards
Sebastian



From jacquesb at fnb.co.za  Tue Jul 24 13:19:11 2007
From: jacquesb at fnb.co.za (Jacques Botha)
Date: Tue, 24 Jul 2007 15:19:11 +0200
Subject: [Linux-cluster] Timed out waiting for a response from Resource
	Group Manager
Message-ID: <1185283151.10165.6.camel@f2821966>

Hi All

I've got a quorate 4 node cluster, with a quorum disk.

_If_ I have _No_ fail over domain, resources or services defined,
executing clustat is _fast_, and gives the following:

[root at fnbgw02 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  fnbgw01.fnb.co.za                     1 Online
  fnbgw02.fnb.co.za                     2 Online, Local
  fnbgw03.fnb.co.za                     3 Online
  fnbgw04.fnb.co.za                     4 Online
  /dev/sdb1                             0 Online, Quorum Disk


_If_ I define a failover group, a resource and a service, executing
clustat is _slow_ and it fails to show the defined service:

[root at fnbgw02 ~]# clustat
Timed out waiting for a response from Resource Group Manager
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  fnbgw01.fnb.co.za                     1 Online, Local
  fnbgw02.fnb.co.za                     2 Online
  fnbgw03.fnb.co.za                     3 Online
  fnbgw04.fnb.co.za                     4 Online
  /dev/sdb1                             0 Online, Quorum Disk

This is on CentOS5, with all the latest updates from CentOS applied.

Please help.
-- 
Jacques Botha
South Africa
+27-11-889-4142

To read FirstRand Bank's Disclaimer for this email click on the following address or copy into your Internet browser: 
https://www.fnb.co.za/disclaimer.html 

If you are unable to access the Disclaimer, send a blank e-mail to
firstrandbankdisclaimer at fnb.co.za and we will send you a copy of the Disclaimer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070724/2389a486/attachment.sig>

From Dan.Askew at jmsmucker.com  Tue Jul 24 13:34:39 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Tue, 24 Jul 2007 09:34:39 -0400
Subject: [Linux-cluster] LVM.Conf changes
In-Reply-To: <1185283151.10165.6.camel@f2821966>
Message-ID: <OF6A65FEE4.8BF18893-ON85257322.004A585A-85257322.004B0C80@jmsmucker.com>

Greetings all,

I have a Linux Cluster that has the following setup

One local LVM (vg01) and several clustered file systems (vg01, vg02, 
vg03....)


What changes can be made in "lvm.conf" so that I can include vg01 in the 
/etc/fstab get it activated and mounted at boot....
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070724/4659d91a/attachment.htm>

From hlawatschek at atix.de  Tue Jul 24 14:21:45 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 24 Jul 2007 16:21:45 +0200
Subject: [Linux-cluster] rgmanager "depend" tag on RHEL4
Message-ID: <200707241621.45892.hlawatschek@atix.de>

Hi,

I'd like to use the "depend" feature of rgmanager on RHEL4. Is there a way to 
do this. E.g. would it be possible to run the cvs HEAD version of rgmanager 
on a RHEL4.5 cluster infrastructure ? 

Mark



From Markus at hochholdinger.net  Tue Jul 24 15:34:16 2007
From: Markus at hochholdinger.net (Markus Hochholdinger)
Date: Tue, 24 Jul 2007 17:34:16 +0200
Subject: [Linux-cluster] gnbd and caching
Message-ID: <200707241734.21198.Markus@hochholdinger.net>

Hi,

i'm running some standalone gnbd servers (option -c) and i'm wondering where 
the (read-)caching of the gnbd happens and when/how the cache will be 
emptied.

As far as i understand it, the caching happens on the gnbd client. Is this 
correct?

If so, when will the cache of the gnbd client be emptied? Will it be emptied 
if no one accesses the device? Or will it be cached until the device is 
reimported (removed and imported)?

Are there other possibilities to clear the read cache? Are there possibilities 
to see how much data is cached (allocated memory of gnbd_recvd)? Or have i to 
look for linux page cache?


I'm using gnbd as network block devices for virtual hosts on other servers. 
These virtual hosts can be moved between several hardware, but i want to be 
sure no old (cached) data will be read by the virtual servers.


-- 
greetings

eMHa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070724/4cf37168/attachment.sig>

From teigland at redhat.com  Tue Jul 24 16:37:24 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 24 Jul 2007 11:37:24 -0500
Subject: [Linux-cluster] dlm_sendd and dlm_sendd running on 100% CPU
In-Reply-To: <20070724073704.4066C48865@l00-bugdead-prods.de>
References: <20070724073704.4066C48865@l00-bugdead-prods.de>
Message-ID: <20070724163724.GC6518@redhat.com>

On Tue, Jul 24, 2007 at 09:37:03AM +0200, Sebastian Reitenbach wrote:
> Hi,
> 
> after getting the cluster up and running, and mounting the gfs2 partitions. 
> With one mounted gfs2 partition, the dlm_sendd and dlm_recvd both consume 
> 100% CPU. When I umount the partition, the CPU usage drops to zero.
> 
> While at 100%, there are no message popping up in /var/log/messages.
> Any idea what could be the reason for the CPU usage?
> 
> I use kernel 2.6.20.15 on x86_64 openSUSE 10.2, openais-0.80.1, and 
> cluster-2.00.00.

This is an old dlm bug that was fixed a long time ago.  I'll be releasing
a new cluster tarball shortly for 2.6.23-rc.

Dave



From james at cloud9.co.uk  Tue Jul 24 16:42:08 2007
From: james at cloud9.co.uk (James Fidell)
Date: Tue, 24 Jul 2007 17:42:08 +0100
Subject: [Linux-cluster] newbie mirrored LVM/GFS query
Message-ID: <46A62BE0.4000006@cloud9.co.uk>

I'm running a couple of CentOS4.5 boxes with all current updates
configured as a cluster.  Both have access to the same three
iscsi-connected disks on which I've created a mirrored LVM volume.
I've created a GFS filesystem on the LVM volume and can mount it
and read/write from both machines in the cluster.  So far, so good.

Now I want to test what happens when different parts of the system
fail.  First I thought I'd try disconnecting one of the iscsi disks
that form the data component of the LVM mirror, leaving the other
data disk and log active.  I expected LVM to fail the disk and run
in degraded linear mode until I restored the disconnected disk.

What actually happens is that all access to the LVM volume hangs.
The iscsi layer on the cluster servers logs that the connection has
dropped, but that's it.

Am I missing something important here, or is this just expecting too
much?

James



From jparsons at redhat.com  Tue Jul 24 16:45:55 2007
From: jparsons at redhat.com (James Parsons)
Date: Tue, 24 Jul 2007 12:45:55 -0400
Subject: [Linux-cluster] ccs_tool addnode with multiple fence devices
In-Reply-To: <46A50F00.3090004@gmail.com>
References: <46A50F00.3090004@gmail.com>
Message-ID: <46A62CC3.8080601@redhat.com>

Aravind Parchuri wrote:

> We have a couple of nodes with redundant power supplies and two APC 
> 7900 power switches that we need to connect them to. I can do ccs_tool 
> addfence to add both switches to the cluster.conf file, but is there 
> any way to specify both switches for each node, from the command line? 
> Thanks in advance.
>
> Aravind.
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Not that I know, but make certain that you use the 'option' attribute to 
make certain both power supplies are off, before they are turned on 
again. The default behavior is reboot, and it is possible for both 
supplies to never be off at the same time - hence the machine is never 
rebooted

In the cluster.conf file, it would look like this:
<clusternode name="node1">
  <fence>
   <method name="1">
    <device name="my_apc" port="1" option="Off"/>
    <device name="my_apc" port="2" option="Off"/>
    <device name="my_apc" port="1" option="On"/>
    <device name="my_apc" port="2" option="On"/>
  </method>
 </fence>
</clusternode>



From teigland at redhat.com  Tue Jul 24 17:57:45 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 24 Jul 2007 12:57:45 -0500
Subject: [Linux-cluster] cluster-2.01.00
Message-ID: <20070724175744.GD6518@redhat.com>

A new source tarball of cluster code from cvs head has been released.  The
cluster tarballs are usually released for the latest released kernel
version, but in this case it's for 2.6.23-rc kernels.  This is because of
the change to the dlm user/kernel interface which we've already begun
using in libdlm (see extra step below to install new dlm headers.)

  ftp://sources.redhat.com/pub/cluster/releases/cluster-2.01.00.tar.gz

. start with 2.6.23(-rc) kernel

. change gfs2 to allow gfs1 to share its lock modules by adding these
  three lines to the end of linux/fs/gfs2/locking.c

  EXPORT_SYMBOL_GPL(gfs2_mount_lockproto);
  EXPORT_SYMBOL_GPL(gfs2_unmount_lockproto);
  EXPORT_SYMBOL_GPL(gfs2_withdraw_lockproto);

. In addition to building and installing the kernel and modules, you need
  to install the new dlm headers, e.g.
  cd /usr/src/linux;
  make headers_install
  cp /usr/src/linux/usr/include/linux/dlm* /usr/include/linux/

. openais 0.80.3 is required (openais.org)
  cd openais-0.80.3
  make; make install DESTDIR=/

. compile and install cluster-2.01.00
  cd cluster-2.01.00
  ./configure --kernel_src=/usr/src/linux (--libdir=/usr/lib64 for x86-64)
  make install



From sebastia at l00-bugdead-prods.de  Tue Jul 24 18:16:18 2007
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Tue, 24 Jul 2007 20:16:18 +0200
Subject: [Linux-cluster] dlm_sendd and dlm_sendd running on 100% CPU
Message-ID: <20070724181618.B06E6488FD@l00-bugdead-prods.de>

Hi,

David Teigland <teigland at redhat.com> wrote: 
> On Tue, Jul 24, 2007 at 09:37:03AM +0200, Sebastian Reitenbach wrote:
> > Hi,
> > 
> > after getting the cluster up and running, and mounting the gfs2 
partitions. 
> > With one mounted gfs2 partition, the dlm_sendd and dlm_recvd both 
consume 
> > 100% CPU. When I umount the partition, the CPU usage drops to zero.
> > 
> > While at 100%, there are no message popping up in /var/log/messages.
> > Any idea what could be the reason for the CPU usage?
> > 
> > I use kernel 2.6.20.15 on x86_64 openSUSE 10.2, openais-0.80.1, and 
> > cluster-2.00.00.
> 
> This is an old dlm bug that was fixed a long time ago.  I'll be releasing
> a new cluster tarball shortly for 2.6.23-rc.
> 
thanks for the information. 

kind regards
Sebastian



From simanhew at gmail.com  Tue Jul 24 19:13:20 2007
From: simanhew at gmail.com (siman hew)
Date: Tue, 24 Jul 2007 15:13:20 -0400
Subject: [Linux-cluster] How does log option work in RHEL5 cluster
Message-ID: <6596a7c70707241213p7911d0f2j6c574db04bd60e70@mail.gmail.com>

I got confused with log option setting in cluster.conf.

I specified log_level 3 in tag rm, verify with ccs_test, but does not look
like it works as expected.

here are commands I run and the results:

 [root at node02 log]# ccs_test connect
Connect successful.
 Connection descriptor = 29700

[root at node02 log]# ccs_test get 29700  //rm/@log_level
Get successful.
 Value = <3>

[root at node02 log]# clulog -s 5 -p progName "severity 5 should not be shown"

[root at node02 log]# tail -1 /var/log/messages
Jul 24 14:52:08 node02 clulog[22819]: <notice> severity 5 should not be shown

[root at node02 log]# clulog -s 5 -l 5 -p progName "severity 5 should show"

[root at node02 log]# tail -1 /var/log/messages
Jul 24 14:54:04 node02 clulog[23000]: <notice> severity 5 should show

[root at node02 log]# clulog -s 5 -l 4 -p progName "severity 5 should not
show, filter is 4"

[root at node02 log]# tail -1 /var/log/messages
Jul 24 14:54:04 node02 clulog[23000]: <notice> severity 5 should show


My question is why severity 5 is shown in log file even if I did not specify
filter in my command, clulog should uses the service manager's assigned log
level that is 3.
The cluster has 2 nodes, and all nodes are rebooted and restarted before run
the commands.

Any explanation ?

Thanks,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070724/abc14de6/attachment.htm>

From chris at cmiware.com  Tue Jul 24 21:16:36 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 24 Jul 2007 16:16:36 -0500
Subject: [Linux-cluster] cluster software interoperability
Message-ID: <46A66C34.5040802@cmiware.com>

My last hope for salvaging RHCS as a viable solution is to add a 3rd 
node to our previous 2-node setup.  Our 2 primary machines are on RHN 
and use the cluster packages from RedHat.  Will I be able to use the 
source files or fedora packages on a 3rd node running Fedora 7, or are 
there compatibility issues between versions?

Cheers,
Chris



From aravind.parchuri at gmail.com  Tue Jul 24 22:53:24 2007
From: aravind.parchuri at gmail.com (Aravind Parchuri)
Date: Tue, 24 Jul 2007 15:53:24 -0700
Subject: [Linux-cluster] ccs_tool addnode with multiple fence devices
In-Reply-To: <46A62CC3.8080601@redhat.com>
References: <46A50F00.3090004@gmail.com> <46A62CC3.8080601@redhat.com>
Message-ID: <46A682E4.6080800@gmail.com>

Thanks for the heads-up - I figured that out while playing around with 
the setup.

Aravind.

jparsons at redhat.com wrote:
> Aravind Parchuri wrote:
> 
>> We have a couple of nodes with redundant power supplies and two APC 
>> 7900 power switches that we need to connect them to. I can do ccs_tool 
>> addfence to add both switches to the cluster.conf file, but is there 
>> any way to specify both switches for each node, from the command line? 
>> Thanks in advance.
>>
>> Aravind.
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> Not that I know, but make certain that you use the 'option' attribute to 
> make certain both power supplies are off, before they are turned on 
> again. The default behavior is reboot, and it is possible for both 
> supplies to never be off at the same time - hence the machine is never 
> rebooted
> 
> In the cluster.conf file, it would look like this:
> <clusternode name="node1">
>  <fence>
>   <method name="1">
>    <device name="my_apc" port="1" option="Off"/>
>    <device name="my_apc" port="2" option="Off"/>
>    <device name="my_apc" port="1" option="On"/>
>    <device name="my_apc" port="2" option="On"/>
>  </method>
> </fence>
> </clusternode>
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From teigland at redhat.com  Wed Jul 25 13:58:01 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 25 Jul 2007 08:58:01 -0500
Subject: [Linux-cluster] cluster software interoperability
In-Reply-To: <46A66C34.5040802@cmiware.com>
References: <46A66C34.5040802@cmiware.com>
Message-ID: <20070725135801.GA8834@redhat.com>

On Tue, Jul 24, 2007 at 04:16:36PM -0500, Chris Harms wrote:
> My last hope for salvaging RHCS as a viable solution is to add a 3rd 
> node to our previous 2-node setup.  Our 2 primary machines are on RHN 
> and use the cluster packages from RedHat.  Will I be able to use the 
> source files or fedora packages on a 3rd node running Fedora 7, or are 
> there compatibility issues between versions?

Your best bet may be to build a rhel5 kernel for the third node from the
src rpm, and then use the cluster source from the RHEL5 branch in cvs.

Dave



From lhh at redhat.com  Wed Jul 25 14:10:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:10:51 -0400
Subject: [Linux-cluster] RGMANAGER segmentation fault
In-Reply-To: <46155.48238.qm@web52309.mail.re2.yahoo.com>
References: <46155.48238.qm@web52309.mail.re2.yahoo.com>
Message-ID: <20070725141051.GI28521@redhat.com>

On Tue, Jul 17, 2007 at 06:44:03PM -0700, chirantha pitigala wrote:
> Hi all,
> 
> We have received segmentation fault at rgmanager several times while restarting a service. (Not frequent). After that automatic restart of the server happened.
> My OS is RHEL 4 Update 4 (2.6.9-42.ELsmp)
> rgmanager-1.9.54-1
> All the cluster packages are RHEL Update4.
> I saw this bug has been fixed in RHEL Update3. 
> 
> Log is as followed.
> 
> Jul 16 11:09:50 UI1 clurgmgrd[3631]: <notice> Stopping service reportgenerator Jul 16 11:09:51 UI1 clurgmgrd: [3631]: <info> Executing /etc/init.d/reportgenerator stop Jul 16 11:09:52 UI1 snmptrapd[3478]: 2007-07-16 11:09:52 192.168.40.1
> [192.168.40.1]: SNMPv2-MIB::sysUpTime.0 = Timeticks: (283880832) 32 days,
> 20:33:28.32     SNMPv2-MIB::snmpTrapOID.0 = OID: UCD-SNMP-MIB::linux   
> SNMPv2-SMI::private.2021.2.1.101 = STRING: "ReportGenerator stopped on UI1"
> Jul 16 11:09:53 UI1 clurgmgrd[3631]: <notice> Service reportgenerator is stopped Jul 16 11:09:53 UI1 kernel: clurgmgrd[28708]: segfault at 0000000000000050 rip 000000000040399e rsp 0000000043245010 error 4 Jul 16 11:10:01 UI1 crond(pam_unix)[32385]: session opened for user root by (uid=0) Jul 16 11:10:01 UI1 crond(pam_unix)[32387]: session opened for user ccs by

There's a couple of fixed crash-bugs in 4.5; you need to update
rgmanager, ccs, magma-plugins, and magma.  See the various errata for
4.5 to see what's specifically fixed.


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:13:02 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:13:02 -0400
Subject: SV: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F3@exchsrv07.rootdom.dk>
References: <00B9BFA1C44A674794C9A1A4F5A22CA57B25F1@exchsrv07.rootdom.dk>
	<OF80385CA9.A66C2A9E-ON8525731E.00442E2F-8525731E.0044C56F@jmsmucker.com>
	<00B9BFA1C44A674794C9A1A4F5A22CA57B25F3@exchsrv07.rootdom.dk>
Message-ID: <20070725141302.GJ28521@redhat.com>

On Fri, Jul 20, 2007 at 02:36:17PM +0200, Kristoffer Lippert wrote:
> Hi,
>  
> A fence device is a device that can "build a fence" around a node, and thus keep it from corrupting a shared filesystem.
> Most commonly i think a powerswitch is used. It simply cuts the power to the decfunct server.
> It looks and works like this:
> http://www.wti.com/guides/rpb115ug.htm
> (but there are ofcourse loads of brands avaliabel - No, I don't work for WTI ;-)

Nor do I, but I'd go with an IPS 800 (or CE version), since someone I
know has one sitting around and it's well-tested :D

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:16:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:16:44 -0400
Subject: [Linux-cluster] [UPDATE] IP monitor failing periodically
In-Reply-To: <46A26F60.8020402@cmiware.com>
References: <46A26F60.8020402@cmiware.com>
Message-ID: <20070725141644.GK28521@redhat.com>

On Sat, Jul 21, 2007 at 03:41:04PM -0500, Chris Harms wrote:
> We reinstalled our machines with RHEL 5 x86_64 (we were running i386) a 
> few weeks ago and the mysterious IP monitoring failures have disappeared. 

> I believe it was postulated that a compiler bug regarding -fpie might be 
> causing segfaults in i386 binaries, so this would support that theory to 
> some degree, although I did not really attempt to confirm it further. 

This makes some sense, and I've heard of this problem.  It's a really
difficult theory to prove beyond doubt, but it makes a lot of sense.

Why else would pinging (effectively) localhost fail randomly?

Broken TCP/IP stack?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:17:35 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:17:35 -0400
Subject: [Linux-cluster] [UPDATE] IP monitor failing periodically
In-Reply-To: <46A5001F.80600@gmail.com>
References: <46A26F60.8020402@cmiware.com> <46A5001F.80600@gmail.com>
Message-ID: <20070725141735.GL28521@redhat.com>

On Mon, Jul 23, 2007 at 12:23:11PM -0700, Aravind Parchuri wrote:
> I'm not sure about the segfaults, but we are facing the same issues on 
> RHEL5 and FC6, i368 - random failovers due to ip-check failures. This 
> workaround seems to help, for now at least:
> 
> http://www.redhat.com/archives/linux-cluster/2006-March/msg00329.html
> 
> I'll check if it is indeed the ping segfaulting and report back when I 
> get some time.

easy way to do it w/o breaking anything:

call ulimit -c unlimited before ping line, and return success even if
ping fails ;)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:19:23 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:19:23 -0400
Subject: [Linux-cluster] Tools always return EXIT_FAILURE
In-Reply-To: <46A2E631.6020108@gmail.com>
References: <46A2E631.6020108@gmail.com>
Message-ID: <20070725141918.GM28521@redhat.com>

On Sat, Jul 21, 2007 at 10:08:01PM -0700, Aravind Parchuri wrote:
> I hope I'm not missing something obvious, but could someone tell me why 
> ccs_tool exits with "EXIT_FAILURE" even when it's completed the 
> requested operation successfully?

Could you file a bugzilla about this?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:22:11 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:22:11 -0400
Subject: [Linux-cluster] Hang on start fence_tool join with qdisk
In-Reply-To: <46A4AF26.20204@mts.com.ua>
References: <46A4AF26.20204@mts.com.ua>
Message-ID: <20070725142211.GN28521@redhat.com>

On Mon, Jul 23, 2007 at 04:37:42PM +0300, Eugene Melnichuk wrote:
> I have problem with my cluster running on RHEL5 + updates from  
> http://people.redhat.com/lhh/rhel5-test/  
> 
> I have 2 node cluster with shared quorum disk, qdiskd is running, but 
> when I start service cman I hang on Starting fencing.
> In my logs I have messages about regained qourum :
> 
> Jul 21 15:50:18 arf-web1 qdiskd[7326]: <info> Assuming master role
> Jul 21 15:50:19 arf-web1 ccsd[8188]: Cluster is not quorate.  Refusing 
> connection.
> Jul 21 15:50:19 arf-web1 ccsd[8188]: Error while processing connect: 
> Connection refused
> Jul 21 15:50:19 arf-web1 openais[8200]: [CMAN ] quorum regained, 
> resuming activity
> Jul 21 15:50:20 arf-web1 clurgmgrd[7746]: <notice> Quorum formed, starting
> Jul 21 15:50:20 arf-web1 kernel: dlm: no local IP address has been set
> Jul 21 15:50:20 arf-web1 kernel: dlm: cannot start dlm lowcomms -12

The cause here is probably the problem.  Does this happen without qdisk?
I don't understand why qdisk would cause this.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:24:05 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:24:05 -0400
Subject: [Linux-cluster] Command-line tools to add resources,
	services to cluster.conf
In-Reply-To: <46A55F7A.8020604@gmail.com>
References: <46A55F7A.8020604@gmail.com>
Message-ID: <20070725142405.GO28521@redhat.com>

On Mon, Jul 23, 2007 at 07:10:02PM -0700, Aravind Parchuri wrote:
> It looks like ccs_tool only does nodes and fence devices. Is there some 
> other tool to add/delete resources, failover domains and services to 
> cluster.conf from the command line?

> I guess it's simple enough to generate the xml, but it would be better 
> if there were some standard way of doing it. Thanks in advance.

There isn't currently, but it wouldn't be difficult to implement one
(even with error checking).  Generally, Conga does the trick :) (even
though it's web based).

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:27:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:27:51 -0400
Subject: [Linux-cluster] Timed out waiting for a response from Resource
	Group Manager
In-Reply-To: <1185283151.10165.6.camel@f2821966>
References: <1185283151.10165.6.camel@f2821966>
Message-ID: <20070725142751.GP28521@redhat.com>

On Tue, Jul 24, 2007 at 03:19:11PM +0200, Jacques Beth wrote:
> Hi All
> 
> I've got a quorate 4 node cluster, with a quorum disk.
> 
> _If_ I have _No_ fail over domain, resources or services defined,
> executing clustat is _fast_, and gives the following:
> 
> [root at fnbgw02 ~]# clustat
> Member Status: Quorate
> 
>   Member Name                        ID   Status
>   ------ ----                        ---- ------
>   fnbgw01.fnb.co.za                     1 Online
>   fnbgw02.fnb.co.za                     2 Online, Local
>   fnbgw03.fnb.co.za                     3 Online
>   fnbgw04.fnb.co.za                     4 Online
>   /dev/sdb1                             0 Online, Quorum Disk
> 
> 
> _If_ I define a failover group, a resource and a service, executing
> clustat is _slow_ and it fails to show the defined service:

That shouldn't matter (I'm not saying it doesn't matter, but that it
shouldn't).  It's especially weird that it looks like you don't even
have rgmanager running... (so there's no service output).

> 
> [root at fnbgw02 ~]# clustat
> Timed out waiting for a response from Resource Group Manager
> Member Status: Quorate
> 
>   Member Name                        ID   Status
>   ------ ----                        ---- ------
>   fnbgw01.fnb.co.za                     1 Online, Local
>   fnbgw02.fnb.co.za                     2 Online
>   fnbgw03.fnb.co.za                     3 Online
>   fnbgw04.fnb.co.za                     4 Online
>   /dev/sdb1                             0 Online, Quorum Disk
> 
> This is on CentOS5, with all the latest updates from CentOS applied.

Could you paste your cluster.conf somewhere?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jul 25 14:40:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:40:51 -0400
Subject: [Linux-cluster] rgmanager "depend" tag on RHEL4
In-Reply-To: <200707241621.45892.hlawatschek@atix.de>
References: <200707241621.45892.hlawatschek@atix.de>
Message-ID: <20070725144050.GQ28521@redhat.com>

On Tue, Jul 24, 2007 at 04:21:45PM +0200, Mark Hlawatschek wrote:
> Hi,
> 
> I'd like to use the "depend" feature of rgmanager on RHEL4. Is there a way to 
> do this. E.g. would it be possible to run the cvs HEAD version of rgmanager 
> on a RHEL4.5 cluster infrastructure ? 

It doesn't build on RHEL4 right now; there's no equivalent to
'cman_get_fenceinfo()'.

If you #if 0 out the 'node_fenced()' body and return 1, it builds.  I
don't know if it will run... but it builds ;)

It doesn't appear there's an equivalent function; so to implement it,
one would have to perhaps parse /proc/cluster/services or something like
that...

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From jacquesb at fnb.co.za  Wed Jul 25 14:43:53 2007
From: jacquesb at fnb.co.za (Jacques Botha)
Date: Wed, 25 Jul 2007 16:43:53 +0200
Subject: [Linux-cluster] Timed out waiting for a response from Resource
	Group Manager
In-Reply-To: <20070725142751.GP28521@redhat.com>
References: <1185283151.10165.6.camel@f2821966>
	<20070725142751.GP28521@redhat.com>
Message-ID: <1185374633.9421.10.camel@f2821966>

Here is the cluster.conf like you asked Lon.


<?xml version="1.0"?>
<cluster alias="fnbgw" config_version="12" name="fnbgw">
        <quorumd interval="2" label="fnbgw_qdisk" log_level="7"
min_score="3" status_file="/tmp/qdisk_status" tko="5" votes="4">
                <heuristic interval="10" program="ping 172.20.28.193 -c3
-t1" score="2"/>
                <heuristic interval="10" program="ping 172.20.28.195 -c3
-t1" score="1"/>
                <heuristic interval="10" program="ping 172.20.28.196 -c3
-t1" score="1"/>
                <heuristic interval="10" program="ping 172.20.28.197 -c3
-t1" score="1"/>
                <heuristic interval="10" program="ping 172.20.28.198 -c3
-t1" score="1"/>
        </quorumd>
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="fnbgw01.fnb.co.za" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device blade="11"
name="BSM_BLADE_CENTRE_BLUE"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="fnbgw02.fnb.co.za" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device blade="12"
name="BSM_BLADE_CENTRE_BLUE"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="fnbgw03.fnb.co.za" nodeid="3"
votes="1">
                        <fence>
                                <method name="1">
                                        <device blade="11"
name="BSM_BLADE_CENTRE_RED3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="fnbgw04.fnb.co.za" nodeid="4"
votes="1">
                        <fence>
                                <method name="1">
                                        <device blade="12"
name="BSM_BLADE_CENTRE_RED3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_bladecenter" ipaddr="1.1.1.1"
login="secretlogin" name="BSM_BLADE_CENTRE_BLUE"
passwd="verysecretpassword"/>
                <fencedevice agent="fence_bladecenter" ipaddr="1.1.1.1"
login="secretlogin" name="BSM_BLADE_CENTRE_RED"
passwd="verysecretpassword"/>
        </fencedevices>
        <rm>
                <resources>
                        <ip address="172.20.28.200" monitor_link="1"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="vmfail" ordered="0"
restricted="0">
                                <failoverdomainnode
name="fnbgw01.fnb.co.za" priority="1"/>
                                <failoverdomainnode
name="fnbgw02.fnb.co.za" priority="1"/>
                                <failoverdomainnode
name="fnbgw03.fnb.co.za" priority="1"/>
                                <failoverdomainnode
name="fnbgw04.fnb.co.za" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="vmfail" exclusive="0"
name="ip200" recovery="restart">
                        <ip ref="172.20.28.200"/>
                </service>
        </rm>
</cluster>



On Wed, 2007-07-25 at 10:27 -0400, Lon Hohberger wrote:
> On Tue, Jul 24, 2007 at 03:19:11PM +0200, Jacques Beth wrote:
> > Hi All
> > 
> > I've got a quorate 4 node cluster, with a quorum disk.
> > 
> > _If_ I have _No_ fail over domain, resources or services defined,
> > executing clustat is _fast_, and gives the following:
> > 
> > [root at fnbgw02 ~]# clustat
> > Member Status: Quorate
> > 
> >   Member Name                        ID   Status
> >   ------ ----                        ---- ------
> >   fnbgw01.fnb.co.za                     1 Online
> >   fnbgw02.fnb.co.za                     2 Online, Local
> >   fnbgw03.fnb.co.za                     3 Online
> >   fnbgw04.fnb.co.za                     4 Online
> >   /dev/sdb1                             0 Online, Quorum Disk
> > 
> > 
> > _If_ I define a failover group, a resource and a service, executing
> > clustat is _slow_ and it fails to show the defined service:
> 
> That shouldn't matter (I'm not saying it doesn't matter, but that it
> shouldn't).  It's especially weird that it looks like you don't even
> have rgmanager running... (so there's no service output).
> 
> > 
> > [root at fnbgw02 ~]# clustat
> > Timed out waiting for a response from Resource Group Manager
> > Member Status: Quorate
> > 
> >   Member Name                        ID   Status
> >   ------ ----                        ---- ------
> >   fnbgw01.fnb.co.za                     1 Online, Local
> >   fnbgw02.fnb.co.za                     2 Online
> >   fnbgw03.fnb.co.za                     3 Online
> >   fnbgw04.fnb.co.za                     4 Online
> >   /dev/sdb1                             0 Online, Quorum Disk
> > 
> > This is on CentOS5, with all the latest updates from CentOS applied.
> 
> Could you paste your cluster.conf somewhere?
> 

To read FirstRand Bank's Disclaimer for this email click on the following address or copy into your Internet browser: 
https://www.fnb.co.za/disclaimer.html 

If you are unable to access the Disclaimer, send a blank e-mail to
firstrandbankdisclaimer at fnb.co.za and we will send you a copy of the Disclaimer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070725/1cf7781b/attachment.sig>

From lhh at redhat.com  Wed Jul 25 14:52:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 10:52:19 -0400
Subject: [Linux-cluster] How does log option work in RHEL5 cluster
In-Reply-To: <6596a7c70707241213p7911d0f2j6c574db04bd60e70@mail.gmail.com>
References: <6596a7c70707241213p7911d0f2j6c574db04bd60e70@mail.gmail.com>
Message-ID: <20070725145219.GR28521@redhat.com>

On Tue, Jul 24, 2007 at 03:13:20PM -0400, siman hew wrote:
> I got confused with log option setting in cluster.conf.
> 
> I specified log_level 3 in tag rm, verify with ccs_test, but does not look
> like it works as expected.
> 
> here are commands I run and the results:
> 
> [root at node02 log]# ccs_test connect
> Connect successful.
> Connection descriptor = 29700
> 
> [root at node02 log]# ccs_test get 29700  //rm/@log_level
> Get successful.
> Value = <3>
> 
> [root at node02 log]# clulog -s 5 -p progName "severity 5 should not be shown"
> 
> [root at node02 log]# tail -1 /var/log/messages
> Jul 24 14:52:08 node02 clulog[22819]: <notice> severity 5 should not be 
> shown
> 
> [root at node02 log]# clulog -s 5 -l 5 -p progName "severity 5 should show"
> 
> [root at node02 log]# tail -1 /var/log/messages
> Jul 24 14:54:04 node02 clulog[23000]: <notice> severity 5 should show
> 
> [root at node02 log]# clulog -s 5 -l 4 -p progName "severity 5 should not
> show, filter is 4"
> 
> [root at node02 log]# tail -1 /var/log/messages
> Jul 24 14:54:04 node02 clulog[23000]: <notice> severity 5 should show
> 
> 
> My question is why severity 5 is shown in log file even if I did not specify
> filter in my command, clulog should uses the service manager's assigned log
> level that is 3.
> The cluster has 2 nodes, and all nodes are rebooted and restarted before run
> the commands.
> 
> Any explanation ?

Seems to work for me?  I have log_facility set, though. (though that
shouldn't matter).

Debug output didn't happen.

-- Lon

[root at red utils]# ccs_test connect
Connect successful.
 Connection descriptor = 5939640
[root at red utils]# ccs_test get 5939640 /cluster/rm/@log_facility
Get successful.
 Value = <local4>
[root at red utils]# ccs_test get 5939640 /cluster/rm/@log_level
Get successful.
 Value = <6>
[root at red utils]# grep rgmanager /etc/syslog.conf 
local4.* /var/log/rgmanager
[root at red utils]# cp /dev/null /var/log/rgmanager 
cp: overwrite `/var/log/rgmanager'? y
[root at red utils]# clulog -s 1 'alert_log_test'
[root at red utils]# clulog -s 5 'notice_log_test'
[root at red utils]# clulog -s 6 'info_log_test'
[root at red utils]# clulog -s 7 'debug_log_test'
[root at red utils]# cat /var/log/rgmanager 
Jul 25 10:45:42 red clulog[25281]: <alert> alert_log_test 
Jul 25 10:45:57 red clulog[25291]: <notice> notice_log_test 
Jul 25 10:46:03 red clulog[25296]: <info> info_log_test 
[root at red utils]# tail -3 /var/log/messages
Jul 25 10:45:42 red clulog[25281]: <alert> alert_log_test 
Jul 25 10:45:57 red clulog[25291]: <notice> notice_log_test 
Jul 25 10:46:03 red clulog[25296]: <info> info_log_test 


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From doc at mts.com.ua  Wed Jul 25 15:00:05 2007
From: doc at mts.com.ua (Eugene Melnichuk)
Date: Wed, 25 Jul 2007 18:00:05 +0300
Subject: [Linux-cluster] Hang on start fence_tool join with qdisk
In-Reply-To: <20070725142211.GN28521@redhat.com>
References: <46A4AF26.20204@mts.com.ua> <20070725142211.GN28521@redhat.com>
Message-ID: <46A76575.5080500@mts.com.ua>


Hi,

Without qdisk (with two_node=1) cluster works fine. But I need qdisk for 
latest transition from 2 -> 3 nodes without cluster restart.
Currently I built test cluster with the same hardware and reproduce this 
problem.
Messages from dlm occurred from time to time, often I have no messages 
after "Quorum formed, starting"
If I set clean_start=1, fencing start fine, but I still lock on access 
to cman_admin socket.

So, if you have any suggestions or new devel. pakages for testing , I 
can install it and gather debug information.
I can open official ticket for that, but since I installed unsupported 
packages this maybe wrong way :)
But, without your unofficial packages I still have non-working qdisk and 
ccs_tool update ...

PS I already tried to install new kernel from 
http://people.redhat.com/dzickus/el5/36.el5/x86_64/   (that contain many 
fixes in DLM) but without luck...


--
Eugene Melnichuk
Leading Engineer
email: doc at umc.ua <mailto:doc at umc.ua>
mob: +380503304043
pbx: +380501105731
CJSC Ukrainian Mobile Communications
49/2 Pobedy ave., room 4.26, 03680, Kyiv, Ukraine



Lon Hohberger ?????:
> On Mon, Jul 23, 2007 at 04:37:42PM +0300, Eugene Melnichuk wrote:
>   
>> I have problem with my cluster running on RHEL5 + updates from  
>> http://people.redhat.com/lhh/rhel5-test/  
>>
>> I have 2 node cluster with shared quorum disk, qdiskd is running, but 
>> when I start service cman I hang on Starting fencing.
>> In my logs I have messages about regained qourum :
>>
>> Jul 21 15:50:18 arf-web1 qdiskd[7326]: <info> Assuming master role
>> Jul 21 15:50:19 arf-web1 ccsd[8188]: Cluster is not quorate.  Refusing 
>> connection.
>> Jul 21 15:50:19 arf-web1 ccsd[8188]: Error while processing connect: 
>> Connection refused
>> Jul 21 15:50:19 arf-web1 openais[8200]: [CMAN ] quorum regained, 
>> resuming activity
>> Jul 21 15:50:20 arf-web1 clurgmgrd[7746]: <notice> Quorum formed, starting
>> Jul 21 15:50:20 arf-web1 kernel: dlm: no local IP address has been set
>> Jul 21 15:50:20 arf-web1 kernel: dlm: cannot start dlm lowcomms -12
>>     
>
> The cause here is probably the problem.  Does this happen without qdisk?
> I don't understand why qdisk would cause this.
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070725/55da229c/attachment.htm>

From jleafey at utmem.edu  Wed Jul 25 15:44:26 2007
From: jleafey at utmem.edu (Jay Leafey)
Date: Wed, 25 Jul 2007 10:44:26 -0500
Subject: [Linux-cluster] cluster software interoperability
In-Reply-To: <20070725135801.GA8834@redhat.com>
References: <46A66C34.5040802@cmiware.com> <20070725135801.GA8834@redhat.com>
Message-ID: <46A76FDA.9000907@utmem.edu>

David Teigland wrote:
> On Tue, Jul 24, 2007 at 04:16:36PM -0500, Chris Harms wrote:
>> My last hope for salvaging RHCS as a viable solution is to add a 3rd 
>> node to our previous 2-node setup.  Our 2 primary machines are on RHN 
>> and use the cluster packages from RedHat.  Will I be able to use the 
>> source files or fedora packages on a 3rd node running Fedora 7, or are 
>> there compatibility issues between versions?
> 
> Your best bet may be to build a rhel5 kernel for the third node from the
> src rpm, and then use the cluster source from the RHEL5 branch in cvs.
> 
> Dave
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

You might also try something RHEL-like, like CentOS 
(http://www.centos.org) or Scientific Linux 
(http://www.scientificlinux.org).  Since you're already used to RHEL, 
either choice will minimize the learning curve if any.

-- 
Jay Leafey - University of Tennessee
E-Mail:  jleafey at utmem.edu  Phone:  901-448-6534  FAX:  901-448-8199

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5158 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070725/004c859f/attachment.bin>

From simanhew at gmail.com  Wed Jul 25 16:59:24 2007
From: simanhew at gmail.com (siman hew)
Date: Wed, 25 Jul 2007 12:59:24 -0400
Subject: [Linux-cluster] How does log option work in RHEL5 cluster
In-Reply-To: <20070725145219.GR28521@redhat.com>
References: <6596a7c70707241213p7911d0f2j6c574db04bd60e70@mail.gmail.com>
	<20070725145219.GR28521@redhat.com>
Message-ID: <6596a7c70707250959n211293a1v8f432837809ca572@mail.gmail.com>

It "works", because you did not change the default log level.
How about change the log level, like me set it to 3, then run the exactly
same commands, see what it will happen.

Thanks,

Siman

On 7/25/07, Lon Hohberger <lhh at redhat.com> wrote:
>
> On Tue, Jul 24, 2007 at 03:13:20PM -0400, siman hew wrote:
> >
> Seems to work for me?  I have log_facility set, though. (though that
> shouldn't matter).
>
> Debug output didn't happen.
>
> -- Lon
>
> [root at red utils]# ccs_test connect
> Connect successful.
> Connection descriptor = 5939640
> [root at red utils]# ccs_test get 5939640 /cluster/rm/@log_facility
> Get successful.
> Value = <local4>
> [root at red utils]# ccs_test get 5939640 /cluster/rm/@log_level
> Get successful.
> Value = <6>
> [root at red utils]# grep rgmanager /etc/syslog.conf
> local4.* /var/log/rgmanager
> [root at red utils]# cp /dev/null /var/log/rgmanager
> cp: overwrite `/var/log/rgmanager'? y
> [root at red utils]# clulog -s 1 'alert_log_test'
> [root at red utils]# clulog -s 5 'notice_log_test'
> [root at red utils]# clulog -s 6 'info_log_test'
> [root at red utils]# clulog -s 7 'debug_log_test'
> [root at red utils]# cat /var/log/rgmanager
> Jul 25 10:45:42 red clulog[25281]: <alert> alert_log_test
> Jul 25 10:45:57 red clulog[25291]: <notice> notice_log_test
> Jul 25 10:46:03 red clulog[25296]: <info> info_log_test
> [root at red utils]# tail -3 /var/log/messages
> Jul 25 10:45:42 red clulog[25281]: <alert> alert_log_test
> Jul 25 10:45:57 red clulog[25291]: <notice> notice_log_test
> Jul 25 10:46:03 red clulog[25296]: <info> info_log_test
>
>
> --
> Lon Hohberger - Software Engineer - Red Hat, Inc.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070725/472a5df7/attachment.htm>

From lhh at redhat.com  Wed Jul 25 18:03:43 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 25 Jul 2007 14:03:43 -0400
Subject: [Linux-cluster] How does log option work in RHEL5 cluster
In-Reply-To: <6596a7c70707250959n211293a1v8f432837809ca572@mail.gmail.com>
References: <6596a7c70707241213p7911d0f2j6c574db04bd60e70@mail.gmail.com>
	<20070725145219.GR28521@redhat.com>
	<6596a7c70707250959n211293a1v8f432837809ca572@mail.gmail.com>
Message-ID: <20070725180343.GA9112@redhat.com>

On Wed, Jul 25, 2007 at 12:59:24PM -0400, siman hew wrote:
> It "works", because you did not change the default log level.
> How about change the log level, like me set it to 3, then run the exactly
> same commands, see what it will happen.

Yup, you're right.  This fixes it.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.
-------------- next part --------------
? clubufflush
? clufindhostname
? clulog
? clunfslock
? clustat
? clusvcadm
? foo.out
? foo.out.diff
? foo.out2
? foo.out3
? foo.out4
Index: clulog.c
===================================================================
RCS file: /cvs/cluster/cluster/rgmanager/src/utils/clulog.c,v
retrieving revision 1.1.2.3
diff -u -r1.1.2.3 clulog.c
--- clulog.c	3 May 2007 15:02:47 -0000	1.1.2.3
+++ clulog.c	25 Jul 2007 17:58:05 -0000
@@ -93,9 +93,6 @@
     if (argc < 4)
 	usage(argv[0]);
 
-    /* ../daemons/log.c */
-    configure_logging(-1);
-
     while ((opt = getopt(argc, argv, "f:l:s:hp:n:")) != -1) {
 	switch (opt) {
 	case 'l':
@@ -134,9 +131,10 @@
     if (!cmdline_loglevel) {
 	/*
 	 * Let's see what loglevel the SM is running at.
-	 * TODO Get rgmgr log level
+	 * If ccsd's not available, use default.
 	 */
-	clu_set_loglevel(LOGLEVEL_DFLT);
+    	if (configure_logging(-1) < 0)
+		clu_set_loglevel(LOGLEVEL_DFLT);
     }
     result = clulog_pid(severity, pid, progname, logmsg);
     free(progname);

From mcse47 at hotmail.com  Wed Jul 25 18:35:02 2007
From: mcse47 at hotmail.com (Tracey Flanders)
Date: Wed, 25 Jul 2007 14:35:02 -0400
Subject: [Linux-cluster] Does GNBD need to be clustered with the other
	servers?
Message-ID: <BAY123-F13D0153734166A333803EFD4F10@phx.gbl>

I have a question about a lab I am setting up. From what I have read and 
know about Red hat clustering I can't seem to find this answer. I have 3 
servers. One will provide storage access via GNBD for the other 2 servers 
using GFS. My question is do I add all three servers into a cluster config 
or do I leave the GNBD server out? I understand that the GNBD server will 
not be redundant. since it has the only server that has a phycical 
connection to the storage.  Thanks.

Tracey Flanders

_________________________________________________________________
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_pcmag_0507



From hlawatschek at atix.de  Wed Jul 25 18:50:17 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Wed, 25 Jul 2007 20:50:17 +0200
Subject: [Linux-cluster] rgmanager "depend" tag on RHEL4
In-Reply-To: <20070725144050.GQ28521@redhat.com>
References: <200707241621.45892.hlawatschek@atix.de>
	<20070725144050.GQ28521@redhat.com>
Message-ID: <200707252050.17898.hlawatschek@atix.de>

> > I'd like to use the "depend" feature of rgmanager on RHEL4. Is there a
> > way to do this. E.g. would it be possible to run the cvs HEAD version of
> > rgmanager on a RHEL4.5 cluster infrastructure ?
>
> It doesn't build on RHEL4 right now; there's no equivalent to
> 'cman_get_fenceinfo()'.
>
> If you #if 0 out the 'node_fenced()' body and return 1, it builds.  I
> don't know if it will run... but it builds ;)
>
> It doesn't appear there's an equivalent function; so to implement it,
> one would have to perhaps parse /proc/cluster/services or something like
> that...
What does the prototype of cman_get_fenceinfo() look like ? What is it meant 
to do ?

Is there another way to get the "depend" feature back-ported ?


-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/




From kanderso at redhat.com  Wed Jul 25 19:16:45 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Wed, 25 Jul 2007 14:16:45 -0500
Subject: [Linux-cluster] Does GNBD need to be clustered with the other
	servers?
In-Reply-To: <BAY123-F13D0153734166A333803EFD4F10@phx.gbl>
References: <BAY123-F13D0153734166A333803EFD4F10@phx.gbl>
Message-ID: <1185391005.2718.24.camel@dhcp80-204.msp.redhat.com>

On Wed, 2007-07-25 at 14:35 -0400, Tracey Flanders wrote:
> I have a question about a lab I am setting up. From what I have read and 
> know about Red hat clustering I can't seem to find this answer. I have 3 
> servers. One will provide storage access via GNBD for the other 2 servers 
> using GFS. My question is do I add all three servers into a cluster config 
> or do I leave the GNBD server out? I understand that the GNBD server will 
> not be redundant. since it has the only server that has a phycical 
> connection to the storage.  Thanks.
Leave the gnbd server out of the cluster config.  Your cluster is really
just the two nodes where gfs exists.  The gnbd server is your shared
storage device so not really part of the cluster.

Kevin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070725/91e127f9/attachment.htm>

From bmarzins at redhat.com  Thu Jul 26 00:12:26 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Wed, 25 Jul 2007 19:12:26 -0500
Subject: [Linux-cluster] gnbd and caching
In-Reply-To: <200707241734.21198.Markus@hochholdinger.net>
References: <200707241734.21198.Markus@hochholdinger.net>
Message-ID: <20070726001226.GG24772@ether.msp.redhat.com>

On Tue, Jul 24, 2007 at 05:34:16PM +0200, Markus Hochholdinger wrote:
> Hi,
> 
> i'm running some standalone gnbd servers (option -c) and i'm wondering where 
> the (read-)caching of the gnbd happens and when/how the cache will be 
> emptied.
> 
> As far as i understand it, the caching happens on the gnbd client. Is this 
> correct?


No. On the gnbd client side, the device works just like every other block
device. When the device is closed, the cache is automatically flushed. 
 
> If so, when will the cache of the gnbd client be emptied? Will it be emptied 
> if no one accesses the device? Or will it be cached until the device is 
> reimported (removed and imported)?
> 
> Are there other possibilities to clear the read cache? Are there possibilities 
> to see how much data is cached (allocated memory of gnbd_recvd)? Or have i to 
> look for linux page cache?
> 

Like I said, the caching option has nothing to do with the client side. GNBD
acts just like any other block device.

The caching option does only one thing.  I determines whether or not the gnbd
server daemon opens the exported block device with the O_DIRECT flag or not.

If the device is uncached, then the gnbd server will access the exported device
with O_DIRECT, which bypasses the cache. Otherwise, the gnbd server will not use
O_DIRECT, and use the page cache. If the exported device was on a SAN, and
being used by exported by multiple gnbd servers (which would allow you to use
dm-multipath for failover), you would want it uncached, because otherwise
changes on other machines might be missed, if you pulled from cache.  Also, if
you were say, exporting part of an LVM volume, which you were also using
locally, you would want to be uncached. Otherwise would have one cache on top of
the exported device, and another cache (for the same device) on top of the LVM
volume. 

-Ben

> I'm using gnbd as network block devices for virtual hosts on other servers. 
> These virtual hosts can be moved between several hardware, but i want to be 
> sure no old (cached) data will be read by the virtual servers.
> 
> 
> -- 
> greetings
> 
> eMHa



> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From hal_bg at yahoo.com  Thu Jul 26 08:44:02 2007
From: hal_bg at yahoo.com (Hal)
Date: Thu, 26 Jul 2007 01:44:02 -0700 (PDT)
Subject: [Linux-cluster] Does GNBD need to be clustered with the other
	servers?
In-Reply-To: <1185391005.2718.24.camel@dhcp80-204.msp.redhat.com>
Message-ID: <974446.49852.qm@web32204.mail.mud.yahoo.com>


On Wed, 2007-07-25 at 14:35 -0400, Tracey Flanders wrote:
> I have a question about a lab I am setting up. From what I have read and 
> know about Red hat clustering I can't seem to find this answer. I have 3 
> servers. One will provide storage access via GNBD for the other 2 servers 
> using GFS. My question is do I add all three servers into a cluster config 
> or do I leave the GNBD server out? I understand that the GNBD server will 
> not be redundant. since it has the only server that has a phycical 
> connection to the storage.  Thanks.

If you intend to use gnbd fencing for the GFS nodes gnbd server should be in 
the cluster.


      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 



From Markus at hochholdinger.net  Thu Jul 26 13:32:20 2007
From: Markus at hochholdinger.net (Markus Hochholdinger)
Date: Thu, 26 Jul 2007 15:32:20 +0200
Subject: [Linux-cluster] gnbd and caching
In-Reply-To: <20070726001226.GG24772@ether.msp.redhat.com>
References: <200707241734.21198.Markus@hochholdinger.net>
	<20070726001226.GG24772@ether.msp.redhat.com>
Message-ID: <200707261532.25195.Markus@hochholdinger.net>

Hi,

Am Donnerstag, 26. Juli 2007 02:12 schrieb Benjamin Marzinski:
> On Tue, Jul 24, 2007 at 05:34:16PM +0200, Markus Hochholdinger wrote:
> > i'm running some standalone gnbd servers (option -c) and i'm wondering
> > where the (read-)caching of the gnbd happens and when/how the cache will
> > be emptied.
> > As far as i understand it, the caching happens on the gnbd client. Is
> > this correct?
> No. On the gnbd client side, the device works just like every other block
> device. When the device is closed, the cache is automatically flushed.

so nothing on the gnbd-client side inside gnbd makes any caching. This is 
good.


> > If so, when will the cache of the gnbd client be emptied? Will it be
> > emptied if no one accesses the device? Or will it be cached until the
> > device is reimported (removed and imported)?
> > Are there other possibilities to clear the read cache? Are there
> > possibilities to see how much data is cached (allocated memory of
> > gnbd_recvd)? Or have i to look for linux page cache?
> Like I said, the caching option has nothing to do with the client side.
> GNBD acts just like any other block device.
> The caching option does only one thing.  I determines whether or not the
> gnbd server daemon opens the exported block device with the O_DIRECT flag
> or not.

Ah, so this caching option is only relevant on the server. So if one gnbd 
client asks the server, it could serve data from the cache. If another 
gnbd-client asks the same server, it also could serve from the same cache. So 
as long as the exported block device will not change, my data would be 
konsistent.
And if i access the block device from the gnbd client, the normal linux 
caching mechanismen will take place.


> If the device is uncached, then the gnbd server will access the exported
> device with O_DIRECT, which bypasses the cache. Otherwise, the gnbd server
> will not use O_DIRECT, and use the page cache. If the exported device was
> on a SAN, and being used by exported by multiple gnbd servers (which would
> allow you to use dm-multipath for failover), you would want it uncached,
> because otherwise changes on other machines might be missed, if you pulled
> from cache.  Also, if you were say, exporting part of an LVM volume, which
> you were also using locally, you would want to be uncached. Otherwise would
> have one cache on top of the exported device, and another cache (for the
> same device) on top of the LVM volume.

I think i understand it now. This gnbd caching thing is all about block device 
caching on the gnbd server, not on the gnbd client.

If i access one exported block device from multiple gnbd clients with O_DIRECT 
my data will be konsistent.
This means, if one gnbd client reads (O_DIRECT) data from a cached gnbd 
server, this data will be cached on the gnbd server. Then another gnbd client 
writes data on the same position. If now the first gnbd client reads again 
the same position, it will get the data written from the second client, 
because the gnbd server serves both clients from the same cache.
Are am I correct?

Many thanks for your explanations.


-- 
greetings

eMHa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/e03970f7/attachment.sig>

From Dan.Askew at jmsmucker.com  Thu Jul 26 13:54:44 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Thu, 26 Jul 2007 09:54:44 -0400
Subject: [Linux-cluster] problem with lvm cluster
In-Reply-To: <506B469CC6211B49BE28F6AC56BBCA35013BF882@STJH-P102.PSNL.CA>
Message-ID: <OF54DDAFB2.0D8A145D-ON85257324.004C1AA3-85257324.004CE48C@jmsmucker.com>

II am having a problem with a volume group. I cannot activate it, change 
it remove or view it.

I have lvm 2.2.02.21 and have compile it with --with-clvmd=cman 
--with-cluster=shared

I then set the cluter bit on the volume group

I cannot do any command against this volume group now...

Any ideas??

RHEL4 Update 4
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/f32713c9/attachment.htm>

From ram.r.rajesh at gmail.com  Thu Jul 26 15:45:41 2007
From: ram.r.rajesh at gmail.com (ram kumar)
Date: Thu, 26 Jul 2007 21:15:41 +0530
Subject: [Linux-cluster] GFS installation
Message-ID: <1310407c0707260845l1ff315ddtf1214b12c0bb5386@mail.gmail.com>

Hi,

can anyone guide me to install GFS in an unix machine.

I am very new even to unix machine..

Kindly let me know the exact steps that i need to follow from the begining.

I used that usage.txt file from red-hat website.

But i getting an error while using make command.
I dont hve GNU package in my machine..

Kindly let me know whether the procedure i follow is correct,please let me
know if any prerquistes have to be done.

Regards,
Ram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/34f8b01b/attachment.htm>

From hal_bg at yahoo.com  Thu Jul 26 16:28:43 2007
From: hal_bg at yahoo.com (Hal)
Date: Thu, 26 Jul 2007 09:28:43 -0700 (PDT)
Subject: [Linux-cluster] GFS installation
In-Reply-To: <1310407c0707260845l1ff315ddtf1214b12c0bb5386@mail.gmail.com>
Message-ID: <314252.52890.qm@web32209.mail.mud.yahoo.com>

What kind of Unix?
The Redhat cluster is for linux only.

Hal

--- ram kumar <ram.r.rajesh at gmail.com> wrote:

> Hi,
> 
> can anyone guide me to install GFS in an unix machine.
> 
> I am very new even to unix machine..
> 
> Kindly let me know the exact steps that i need to follow from the begining.
> 
> I used that usage.txt file from red-hat website.
> 
> But i getting an error while using make command.
> I dont hve GNU package in my machine..
> 
> Kindly let me know whether the procedure i follow is correct,please let me
> know if any prerquistes have to be done.
> 
> Regards,
> Ram
> > --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



       
____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433



From bmarzins at redhat.com  Thu Jul 26 18:10:29 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 26 Jul 2007 13:10:29 -0500
Subject: [Linux-cluster] Does GNBD need to be clustered with the other
	servers?
In-Reply-To: <974446.49852.qm@web32204.mail.mud.yahoo.com>
References: <1185391005.2718.24.camel@dhcp80-204.msp.redhat.com>
	<974446.49852.qm@web32204.mail.mud.yahoo.com>
Message-ID: <20070726181029.GJ24772@ether.msp.redhat.com>

On Thu, Jul 26, 2007 at 01:44:02AM -0700, Hal wrote:
> 
> On Wed, 2007-07-25 at 14:35 -0400, Tracey Flanders wrote:
> > I have a question about a lab I am setting up. From what I have read and 
> > know about Red hat clustering I can't seem to find this answer. I have 3 
> > servers. One will provide storage access via GNBD for the other 2 servers 
> > using GFS. My question is do I add all three servers into a cluster config 
> > or do I leave the GNBD server out? I understand that the GNBD server will 
> > not be redundant. since it has the only server that has a phycical 
> > connection to the storage.  Thanks.
> 
> If you intend to use gnbd fencing for the GFS nodes gnbd server should be in 
> the cluster.

The gnbd server node needs to be a cluster member for fence_gnbd to work.

-Ben

> 
> 
>       ____________________________________________________________________________________
> Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
> http://autos.yahoo.com/green_center/ 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From bmarzins at redhat.com  Thu Jul 26 18:24:50 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 26 Jul 2007 13:24:50 -0500
Subject: [Linux-cluster] GFS locks wnen a node fails
In-Reply-To: <224214.96819.qm@web32204.mail.mud.yahoo.com>
References: <224214.96819.qm@web32204.mail.mud.yahoo.com>
Message-ID: <20070726182450.GK24772@ether.msp.redhat.com>

On Fri, Jul 20, 2007 at 01:39:13AM -0700, Hal wrote:
> Hallo everybody,
> I have a test cluster of 4 machines. node0 - gnbd server and gnbd fence server
> and 3 nodes to mount gfs. The problem is that when I unplug one of the nodes,
> gfs locks and no one can access it until the node is reconnected. 
> 
> How can this lock be avoided if one node fails? 
> How can I tell that gnbd-fencing is working at all?
> 
> "gnbd_import -c node0" says nothing even if I do "fence_node node2" I assume 
> fencing is not working am I right?

To check fencing, you can to a
# gnbd_import -s <node_to_fence> -t <server_to_fence_from>
# gnbd_import -c <server_you_fenced_from>

This will show the banned list on the server. The fenced node should be there.
The node will remain on the panned list until it is manually unfenced, or
until the gnbd module is reloaded, and it reimports devices from the server.
 
> Regards 
> Hal
> 
> 
>        
> ____________________________________________________________________________________Ready for the edge of your seat? 
> Check out tonight's top picks on Yahoo! TV. 
> http://tv.yahoo.com/
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From bmarzins at redhat.com  Thu Jul 26 18:30:21 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 26 Jul 2007 13:30:21 -0500
Subject: [Linux-cluster] gnbd and caching
In-Reply-To: <200707261532.25195.Markus@hochholdinger.net>
References: <200707241734.21198.Markus@hochholdinger.net>
	<20070726001226.GG24772@ether.msp.redhat.com>
	<200707261532.25195.Markus@hochholdinger.net>
Message-ID: <20070726183021.GL24772@ether.msp.redhat.com>

On Thu, Jul 26, 2007 at 03:32:20PM +0200, Markus Hochholdinger wrote:
> Hi,
> 
> Am Donnerstag, 26. Juli 2007 02:12 schrieb Benjamin Marzinski:
> > On Tue, Jul 24, 2007 at 05:34:16PM +0200, Markus Hochholdinger wrote:
> > > i'm running some standalone gnbd servers (option -c) and i'm wondering
> > > where the (read-)caching of the gnbd happens and when/how the cache will
> > > be emptied.
> > > As far as i understand it, the caching happens on the gnbd client. Is
> > > this correct?
> > No. On the gnbd client side, the device works just like every other block
> > device. When the device is closed, the cache is automatically flushed.
> 
> so nothing on the gnbd-client side inside gnbd makes any caching. This is 
> good.
> 
> 
> > > If so, when will the cache of the gnbd client be emptied? Will it be
> > > emptied if no one accesses the device? Or will it be cached until the
> > > device is reimported (removed and imported)?
> > > Are there other possibilities to clear the read cache? Are there
> > > possibilities to see how much data is cached (allocated memory of
> > > gnbd_recvd)? Or have i to look for linux page cache?
> > Like I said, the caching option has nothing to do with the client side.
> > GNBD acts just like any other block device.
> > The caching option does only one thing.  I determines whether or not the
> > gnbd server daemon opens the exported block device with the O_DIRECT flag
> > or not.
> 
> Ah, so this caching option is only relevant on the server. So if one gnbd 
> client asks the server, it could serve data from the cache. If another 
> gnbd-client asks the same server, it also could serve from the same cache. So 
> as long as the exported block device will not change, my data would be 
> konsistent.
> And if i access the block device from the gnbd client, the normal linux 
> caching mechanismen will take place.
> 
> 
> > If the device is uncached, then the gnbd server will access the exported
> > device with O_DIRECT, which bypasses the cache. Otherwise, the gnbd server
> > will not use O_DIRECT, and use the page cache. If the exported device was
> > on a SAN, and being used by exported by multiple gnbd servers (which would
> > allow you to use dm-multipath for failover), you would want it uncached,
> > because otherwise changes on other machines might be missed, if you pulled
> > from cache.  Also, if you were say, exporting part of an LVM volume, which
> > you were also using locally, you would want to be uncached. Otherwise would
> > have one cache on top of the exported device, and another cache (for the
> > same device) on top of the LVM volume.
> 
> I think i understand it now. This gnbd caching thing is all about block device 
> caching on the gnbd server, not on the gnbd client.
> 
> If i access one exported block device from multiple gnbd clients with O_DIRECT 
> my data will be konsistent.
> This means, if one gnbd client reads (O_DIRECT) data from a cached gnbd 
> server, this data will be cached on the gnbd server. Then another gnbd client 
> writes data on the same position. If now the first gnbd client reads again 
> the same position, it will get the data written from the second client, 
> because the gnbd server serves both clients from the same cache.
> Are am I correct?

Yes. All the clients use the same cache on the server, so if they open the gnbd device O_DIRECT, they will all see eachother's writes.
 
> Many thanks for your explanations.
> 
> 
> -- 
> greetings
> 
> eMHa



> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From hal_bg at yahoo.com  Thu Jul 26 18:35:46 2007
From: hal_bg at yahoo.com (Hal)
Date: Thu, 26 Jul 2007 11:35:46 -0700 (PDT)
Subject: [Linux-cluster] GFS locks wnen a node fails
In-Reply-To: <20070726182450.GK24772@ether.msp.redhat.com>
Message-ID: <744077.3930.qm@web32209.mail.mud.yahoo.com>

I have fixed the problem a week ago :)
The problem was that GNBD server was not a cluster node.
10x anyway!

--- Benjamin Marzinski <bmarzins at redhat.com> wrote:

> On Fri, Jul 20, 2007 at 01:39:13AM -0700, Hal wrote:
> > Hallo everybody,
> > I have a test cluster of 4 machines. node0 - gnbd server and gnbd fence
> server
> > and 3 nodes to mount gfs. The problem is that when I unplug one of the
> nodes,
> > gfs locks and no one can access it until the node is reconnected. 
> > 
> > How can this lock be avoided if one node fails? 
> > How can I tell that gnbd-fencing is working at all?
> > 
> > "gnbd_import -c node0" says nothing even if I do "fence_node node2" I
> assume 
> > fencing is not working am I right?
> 
> To check fencing, you can to a
> # gnbd_import -s <node_to_fence> -t <server_to_fence_from>
> # gnbd_import -c <server_you_fenced_from>
> 
> This will show the banned list on the server. The fenced node should be
> there.
> The node will remain on the panned list until it is manually unfenced, or
> until the gnbd module is reloaded, and it reimports devices from the server.
>  
> > Regards 
> > Hal
> > 
> > 
> >        
> >
>
____________________________________________________________________________________Ready
> for the edge of your seat? 
> > Check out tonight's top picks on Yahoo! TV. 
> > http://tv.yahoo.com/
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



       
____________________________________________________________________________________
Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games.
http://sims.yahoo.com/  



From sys.mailing at gmail.com  Thu Jul 26 18:59:13 2007
From: sys.mailing at gmail.com (Bjorn Oglefjorn)
Date: Thu, 26 Jul 2007 14:59:13 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <AcekV7L+j71k2cJ0Q9iH7q5JvB8TBg==>
	<F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <926ab61b0707261159n50ef0a31i4bf9d76b0f6e9755@mail.gmail.com>

I'd love to know how to do this as well.  Anyone?
--BO

On 6/1/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
>
> 1) Node 1 (master) dies
>         -How do we enable "sticky" failover so that it does not then fail
> back to Node 1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/76f5f882/attachment.htm>

From sys.mailing at gmail.com  Thu Jul 26 19:17:59 2007
From: sys.mailing at gmail.com (Bjorn Oglefjorn)
Date: Thu, 26 Jul 2007 15:17:59 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
In-Reply-To: <926ab61b0707261159n50ef0a31i4bf9d76b0f6e9755@mail.gmail.com>
References: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<926ab61b0707261159n50ef0a31i4bf9d76b0f6e9755@mail.gmail.com>
Message-ID: <926ab61b0707261217q34ffff3bobbb068281e95b379@mail.gmail.com>

I found that a 'nofailback' option was added for the <failoverdomains>
section of the conf.  I can't find any reference to 'nofailback' in any RHCS
doc I can find.  I'm guessing it should look like this:

    <failoverdomain name="test_failover_domain" ordered="1" restricted="1"
nofailback="1">
        ...
    </failoverdomain>

Can someone confirm?  I will attempt to confirm this myself and will report
back when I know for sure.

--BO

On 7/26/07, Bjorn Oglefjorn <sys.mailing at gmail.com> wrote:
>
> I'd love to know how to do this as well.  Anyone?
> --BO
>
> On 6/1/07, Robert Gil <Robert.Gil at americanhm.com > wrote:
> >
> > 1) Node 1 (master) dies
> >         -How do we enable "sticky" failover so that it does not then
> > fail back to Node 1
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/7fc3b682/attachment.htm>

From jwilson at transolutions.net  Thu Jul 26 19:30:05 2007
From: jwilson at transolutions.net (James Wilson)
Date: Thu, 26 Jul 2007 14:30:05 -0500
Subject: [Linux-cluster] IP not failing over
Message-ID: <46A8F63D.8000603@transolutions.net>

Can someone point why my virtual ip is not failing over. Any help is 
appreciated.



<?xml version="1.0"?>
<cluster config_version="7" name="nas-cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="dolphins" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nas-fencing" 
nodename="dolphins"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="lions" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nas-gnbd-fencing" 
nodename="lions"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="patriots" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nas-fencing" 
nodename="patriots"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_gnbd" name="nas-gnbd-fencing" 
servers="dolphins-storage-failover"/>
                <fencedevice agent="fence_manual" name="nas-fencing"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="dolphins-drbd1" 
ordered="1" restricted="0">
                                <failoverdomainnode name="dolphins" 
priority="1"/>
                                <failoverdomainnode name="patriots" 
priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="dolphins-drbd2" 
ordered="1" restricted="0">
                                <failoverdomainnode name="dolphins" 
priority="2"/>
                                <failoverdomainnode name="patriots" 
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.5.4" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="dolphins-drbd1" 
name="dolphins-svc-drbd1" recovery="restart">
                        <ip ref="192.168.5.4"/>
                </service>
                <service autostart="1" domain="dolphins-drbd2" 
name="dolphins-svc-drbd2" recovery="restart">
                        <ip ref="192.168.5.4"/>
                </service>
        </rm>
        <fence_xvmd/>
        <fence_xvmd/>
</cluster>
       
       



From Robert.Gil at americanhm.com  Thu Jul 26 20:21:14 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Thu, 26 Jul 2007 16:21:14 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
In-Reply-To: <926ab61b0707261217q34ffff3bobbb068281e95b379@mail.gmail.com>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA986@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

I had figured this out a while back and forgot to post back to the list.
The steps are as follows
 
Create a child script to the virtual ip to check whether mysql is up.
Have mysql NOT start on boot.
 
With these two things done. When the vip fails over, it will not fail
back because mysql wont be up. 
 
Rob Gil
Linux Systems Administrator
American Home Mortgage

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bjorn Oglefjorn
Sent: Thursday, July 26, 2007 3:18 PM
To: linux clustering
Subject: Re: [Linux-cluster] MySQL Failover / Failback


I found that a 'nofailback' option was added for the <failoverdomains>
section of the conf.  I can't find any reference to 'nofailback' in any
RHCS doc I can find.  I'm guessing it should look like this: 

    <failoverdomain name="test_failover_domain" ordered="1"
restricted="1" nofailback="1">
        ...
    </failoverdomain>

Can someone confirm?  I will attempt to confirm this myself and will
report back when I know for sure. 

--BO


On 7/26/07, Bjorn Oglefjorn <sys.mailing at gmail.com> wrote: 

	I'd love to know how to do this as well.  Anyone?
	--BO 
	
	
	
	On 6/1/07, Robert Gil <Robert.Gil at americanhm.com
<mailto:Robert.Gil at americanhm.com> > wrote: 

		1) Node 1 (master) dies
		        -How do we enable "sticky" failover so that it
does not then fail back to Node 1

	



From Markus at hochholdinger.net  Thu Jul 26 20:23:26 2007
From: Markus at hochholdinger.net (Markus Hochholdinger)
Date: Thu, 26 Jul 2007 22:23:26 +0200
Subject: [Linux-cluster] gnbd and caching
In-Reply-To: <20070726183021.GL24772@ether.msp.redhat.com>
References: <200707241734.21198.Markus@hochholdinger.net>
	<200707261532.25195.Markus@hochholdinger.net>
	<20070726183021.GL24772@ether.msp.redhat.com>
Message-ID: <200707262223.31010.Markus@hochholdinger.net>

Hi,

Am Donnerstag, 26. Juli 2007 20:30 schrieb Benjamin Marzinski:
[..]
> Yes. All the clients use the same cache on the server, so if they open the
> gnbd device O_DIRECT, they will all see eachother's writes.

many thanks for your explanations. I had doubt that my setup is OK, but now 
i'm sure it's correct. Again, many thanks :-)


-- 
greetings

eMHa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/95fd656d/attachment.sig>

From sys.mailing at gmail.com  Thu Jul 26 20:32:38 2007
From: sys.mailing at gmail.com (Bjorn Oglefjorn)
Date: Thu, 26 Jul 2007 16:32:38 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA986@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <926ab61b0707261217q34ffff3bobbb068281e95b379@mail.gmail.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA986@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <926ab61b0707261332t158e2a00l18abd6dae8b8a5cf@mail.gmail.com>

Thanks Rob, but that's not quite what I'm looking for.  I have a middleware
script of my own which allows me to have both nodes with mysql running at
all times.  I need to ensure that when the MASTER fails, Cluster Suite will
failover to the SLAVE.  My script then performs some SQL ops to allow
writes, etc.  When the MASTER comes back from being fenced, I need to make
sure that Cluster Suite does not failback to the MASTER.  I'm hoping that
the 'nofailback' option exists and will work as I expect it to.

--BO

On 7/26/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
>
> I had figured this out a while back and forgot to post back to the list.
> The steps are as follows
>
> Create a child script to the virtual ip to check whether mysql is up.
> Have mysql NOT start on boot.
>
> With these two things done. When the vip fails over, it will not fail
> back because mysql wont be up.
>
> Rob Gil
> Linux Systems Administrator
> American Home Mortgage
>
> ________________________________
>
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bjorn Oglefjorn
> Sent: Thursday, July 26, 2007 3:18 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] MySQL Failover / Failback
>
>
> I found that a 'nofailback' option was added for the <failoverdomains>
> section of the conf.  I can't find any reference to 'nofailback' in any
> RHCS doc I can find.  I'm guessing it should look like this:
>
>     <failoverdomain name="test_failover_domain" ordered="1"
> restricted="1" nofailback="1">
>         ...
>     </failoverdomain>
>
> Can someone confirm?  I will attempt to confirm this myself and will
> report back when I know for sure.
>
> --BO
>
>
> On 7/26/07, Bjorn Oglefjorn <sys.mailing at gmail.com> wrote:
>
>         I'd love to know how to do this as well.  Anyone?
>         --BO
>
>
>
>         On 6/1/07, Robert Gil <Robert.Gil at americanhm.com
> <mailto:Robert.Gil at americanhm.com> > wrote:
>
>                 1) Node 1 (master) dies
>                         -How do we enable "sticky" failover so that it
> does not then fail back to Node 1
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/152b982d/attachment.htm>

From Robert.Gil at americanhm.com  Thu Jul 26 21:24:39 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Thu, 26 Jul 2007 17:24:39 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
In-Reply-To: <926ab61b0707261332t158e2a00l18abd6dae8b8a5cf@mail.gmail.com>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA98B@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

Well, if you prevent mysql from starting on boot, your load balancer
should recognize that and not fail back to that server. We took mysql
itself out of the cluster since we also do replication. It was not
necessary to have it in there for our application.
 
Rob Gil
Linux Systems Administrator
American Home Mortgage
 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bjorn Oglefjorn
Sent: Thursday, July 26, 2007 4:33 PM
To: linux clustering
Subject: Re: [Linux-cluster] MySQL Failover / Failback


Thanks Rob, but that's not quite what I'm looking for.  I have a
middleware script of my own which allows me to have both nodes with
mysql running at all times.  I need to ensure that when the MASTER
fails, Cluster Suite will failover to the SLAVE.  My script then
performs some SQL ops to allow writes, etc.  When the MASTER comes back
from being fenced, I need to make sure that Cluster Suite does not
failback to the MASTER.  I'm hoping that the 'nofailback' option exists
and will work as I expect it to. 

--BO


On 7/26/07, Robert Gil <Robert.Gil at americanhm.com> wrote: 

	I had figured this out a while back and forgot to post back to
the list.
	The steps are as follows
	
	Create a child script to the virtual ip to check whether mysql
is up.
	Have mysql NOT start on boot.
	
	With these two things done. When the vip fails over, it will not
fail 
	back because mysql wont be up.
	
	Rob Gil
	Linux Systems Administrator
	American Home Mortgage
	
	________________________________
	
	From: linux-cluster-bounces at redhat.com 
	[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bjorn
Oglefjorn
	Sent: Thursday, July 26, 2007 3:18 PM
	To: linux clustering
	Subject: Re: [Linux-cluster] MySQL Failover / Failback 
	
	
	I found that a 'nofailback' option was added for the
<failoverdomains>
	section of the conf.  I can't find any reference to 'nofailback'
in any
	RHCS doc I can find.  I'm guessing it should look like this: 
	
	    <failoverdomain name="test_failover_domain" ordered="1"
	restricted="1" nofailback="1">
	        ...
	    </failoverdomain>
	
	Can someone confirm?  I will attempt to confirm this myself and
will 
	report back when I know for sure.
	
	--BO
	
	
	On 7/26/07, Bjorn Oglefjorn <sys.mailing at gmail.com> wrote:
	
	        I'd love to know how to do this as well.  Anyone? 
	        --BO
	
	
	
	        On 6/1/07, Robert Gil <Robert.Gil at americanhm.com
	<mailto:Robert.Gil at americanhm.com > > wrote:
	
	                1) Node 1 (master) dies
	                        -How do we enable "sticky" failover so
that it
	does not then fail back to Node 1
	
	
	
	--
	Linux-cluster mailing list 
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster
	


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070726/0861537d/attachment.htm>

From aravind.parchuri at gmail.com  Thu Jul 26 23:52:51 2007
From: aravind.parchuri at gmail.com (Aravind Parchuri)
Date: Thu, 26 Jul 2007 16:52:51 -0700
Subject: [Linux-cluster] Tools always return EXIT_FAILURE
In-Reply-To: <20070725141918.GM28521@redhat.com>
References: <46A2E631.6020108@gmail.com> <20070725141918.GM28521@redhat.com>
Message-ID: <46A933D3.1020101@gmail.com>

lhh at redhat.com wrote:
> On Sat, Jul 21, 2007 at 10:08:01PM -0700, Aravind Parchuri wrote:
>> I hope I'm not missing something obvious, but could someone tell me why 
>> ccs_tool exits with "EXIT_FAILURE" even when it's completed the 
>> requested operation successfully?
> 
> Could you file a bugzilla about this?
> 
Done. I asked before filing just in case I missed something.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249781

Aravind.



From bernard.chew at muvee.com  Fri Jul 27 02:33:37 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Fri, 27 Jul 2007 10:33:37 +0800
Subject: [Linux-cluster] iscsi-target server need to be part of cluster?
Message-ID: <229C73600EB0E54DA818AB599482BCE951EAFB@shadowfax.sg.muvee.net>


Hi,

If we have 5 servers with 1 server acting as an iscsi-target for the remaining 4 servers, do we need to include the iscsi-target server as part of the cluster? The intention is to export the partition via iscsi from 1 server and use it as a GFS filesystem for the remaining 4 servers.

Thanks,
Bernard Chew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070727/4eb47258/attachment.htm>

From jnt at cs.auckland.ac.nz  Fri Jul 27 03:24:36 2007
From: jnt at cs.auckland.ac.nz (James Tyson)
Date: Fri, 27 Jul 2007 15:24:36 +1200
Subject: [Linux-cluster] Debug fencing.
Message-ID: <90A852D6-E74A-4529-96CA-C713AE69C7A0@cs.auckland.ac.nz>


Hi all.

I'm wondering if there is a way to get decent logs about what fenced  
is trying to do?  My machines (centos5) seem to hang at "starting  
fencing" and the log files aren't showing anything of import.

Any pointers on getting more debugging would be great.  Everything  
seemed to work until I added a third machine and changed the  
multicast address.

My cluster.conf as follows:

<?xml version="1.0" ?>
<cluster config_version="13" name="XenHosts">
         <fence_daemon post_fail_delay="0" post_join_delay="10"/>
         <clusternodes>
                 <clusternode name="karma01.sfac.auckland.ac.nz"  
nodeid="1" votes="1">
                         <multicast addr="239.193.236.96"  
interface="eth0"/>
                         <fence>
                                 <method name="1">
                                         <device name="Faux_Fence"  
nodename="karma01.sfac.auckland.ac.nz"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="karma02.sfac.auckland.ac.nz"  
nodeid="2" votes="1">
                         <multicast addr="239.193.236.96"  
interface="eth0"/>
                         <fence>
                                 <method name="1">
                                         <device name="Faux_Fence"  
nodename="karma02.sfac.auckland.ac.nz"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="xmail01.sfac.auckland.ac.nz"  
nodeid="3" votes="1">
                         <multicast addr="239.193.236.96"  
interface="eth0"/>
                         <fence>
                                 <method name="1">
                                         <device name="Faux_Fence"  
nodename="xmail01.sfac.auckland.ac.nz"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman>
                 <multicast addr="239.193.236.96"/>
         </cman>
         <fencedevices>
                 <fencedevice agent="fence_manual" name="Faux_Fence"/>
         </fencedevices>
         <rm>
                 <failoverdomains>
                         <failoverdomain name="XenVMs" ordered="0"  
restricted="1">
                                 <failoverdomainnode  
name="karma01.sfac.auckland.ac.nz" priority="1"/>
                                 <failoverdomainnode  
name="karma02.sfac.auckland.ac.nz" priority="1"/>
                         </failoverdomain>
                         <failoverdomain name="MailVMs" ordered="1"  
restricted="0">
                                 <failoverdomainnode  
name="xmail01.sfac.auckland.ac.nz" priority="1"/>
                                 <failoverdomainnode  
name="karma01.sfac.auckland.ac.nz" priority="2"/>
                                 <failoverdomainnode  
name="karma02.sfac.auckland.ac.nz" priority="2"/>
                         </failoverdomain>
                 </failoverdomains>
                 <resources>
                         <clusterfs device="/dev/Xen/Configs"  
force_unmount="0" fsid="18041" fstype="gfs" mountpoint="/var/xen"  
name="XenConfigs" options=""/>
                 </resources>
         </rm>
</cluster>


-- 
James Tyson
x87422
9:249:17:2:157:116:227:91:216:65:86:197:99:86:136:192




From lhh at redhat.com  Fri Jul 27 13:45:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 27 Jul 2007 09:45:19 -0400
Subject: [Linux-cluster] rgmanager "depend" tag on RHEL4
In-Reply-To: <200707252050.17898.hlawatschek@atix.de>
References: <200707241621.45892.hlawatschek@atix.de>
	<20070725144050.GQ28521@redhat.com>
	<200707252050.17898.hlawatschek@atix.de>
Message-ID: <20070727134519.GD9112@redhat.com>

On Wed, Jul 25, 2007 at 08:50:17PM +0200, Mark Hlawatschek wrote:
> > > I'd like to use the "depend" feature of rgmanager on RHEL4. Is there a
> > > way to do this. E.g. would it be possible to run the cvs HEAD version of
> > > rgmanager on a RHEL4.5 cluster infrastructure ?
> >
> > It doesn't build on RHEL4 right now; there's no equivalent to
> > 'cman_get_fenceinfo()'.
> >
> > If you #if 0 out the 'node_fenced()' body and return 1, it builds.  I
> > don't know if it will run... but it builds ;)
> >
> > It doesn't appear there's an equivalent function; so to implement it,
> > one would have to perhaps parse /proc/cluster/services or something like
> > that...
> What does the prototype of cman_get_fenceinfo() look like ? What is it meant 
> to do ?

It gets the time the node was last fenced from CMAN.  When a node
rejoins the cluster, this information is cleared.

So, what rgmanager does is effectively:
  nodeX died
  does nodeX have fencing configured?
    yes
      wait for node X to be fenced or rejoin the cluster
    no
      continue

> Is there another way to get the "depend" feature back-ported ?

I think there's already a feature request open about it, by you?

Implementation-wise, in RHEL5, there's an event queue for service state
changes, and a function which broadcasts starts/stops
(broadcast_event()).  The difference is that in RHEL5, you can multicast
to all rgmanagers w/ one call - whereas on RHEL4, you have to open a
connection to each and send the message individually.

So, basically, it involves porting the event queue code and implementing
broadcast_event().

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From jbrassow at redhat.com  Fri Jul 27 15:32:46 2007
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Fri, 27 Jul 2007 10:32:46 -0500
Subject: [Linux-cluster] problem with lvm cluster
In-Reply-To: <OF54DDAFB2.0D8A145D-ON85257324.004C1AA3-85257324.004CE48C@jmsmucker.com>
References: <OF54DDAFB2.0D8A145D-ON85257324.004C1AA3-85257324.004CE48C@jmsmucker.com>
Message-ID: <DCBD96DC-822F-4EE8-BD02-56B8EA90C15E@redhat.com>


On Jul 26, 2007, at 8:54 AM, Dan.Askew at jmsmucker.com wrote:

>
> II am having a problem with a volume group. I cannot activate it,  
> change it remove or view it.
>
> I have lvm 2.2.02.21 and have compile it with --with-clvmd=cman -- 
> with-cluster=shared
>
> I then set the cluter bit on the volume group
>
> I cannot do any command against this volume group now...
>
> Any ideas??
>

You didn't say if you had started cman (and if it is quorate) or if  
you had started clvmd.  Have you?

  brassow


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070727/89c7ece3/attachment.htm>

From lhh at redhat.com  Fri Jul 27 17:09:42 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 27 Jul 2007 13:09:42 -0400
Subject: [Linux-cluster] IP not failing over
In-Reply-To: <46A8F63D.8000603@transolutions.net>
References: <46A8F63D.8000603@transolutions.net>
Message-ID: <20070727170942.GE9112@redhat.com>

On Thu, Jul 26, 2007 at 02:30:05PM -0500, James Wilson wrote:
> Can someone point why my virtual ip is not failing over. Any help is 
> appreciated.

>                <resources>
>                        <ip address="192.168.5.4" monitor_link="1"/>
>                </resources>
>                <service autostart="1" domain="dolphins-drbd1" 
> name="dolphins-svc-drbd1" recovery="restart">
>                        <ip ref="192.168.5.4"/>
>                </service>
>                <service autostart="1" domain="dolphins-drbd2" 
> name="dolphins-svc-drbd2" recovery="restart">
>                        <ip ref="192.168.5.4"/>
>                </service>

You can't reference an IP in multiple services; the second reference is
ignored.  See:

  rg_test test /etc/cluster/cluster.conf &> /tmp/foo.out
  less /tmp/foo.out

The services should be independent and able to coexist.  If they are not
cohabitable, you need to use restricted domains to prevent them from
coexisting.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Jul 27 17:11:11 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 27 Jul 2007 13:11:11 -0400
Subject: [Linux-cluster] iscsi-target server need to be part of cluster?
In-Reply-To: <229C73600EB0E54DA818AB599482BCE951EAFB@shadowfax.sg.muvee.net>
References: <229C73600EB0E54DA818AB599482BCE951EAFB@shadowfax.sg.muvee.net>
Message-ID: <20070727171111.GF9112@redhat.com>

On Fri, Jul 27, 2007 at 10:33:37AM +0800, Bernard Chew wrote:
> 
> Hi,
> 
> If we have 5 servers with 1 server acting as an iscsi-target for the remaining 4 servers, do we need to include the iscsi-target server as part of the cluster? The intention is to export the partition via iscsi from 1 server and use it as a GFS filesystem for the remaining 4 servers.

No, you don't need machines acting solely as storage targets to be
part of the cluster.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From jwilson at transolutions.net  Fri Jul 27 18:09:50 2007
From: jwilson at transolutions.net (James Wilson)
Date: Fri, 27 Jul 2007 13:09:50 -0500
Subject: [Linux-cluster] IP not failing over
In-Reply-To: <20070727170942.GE9112@redhat.com>
References: <46A8F63D.8000603@transolutions.net>
	<20070727170942.GE9112@redhat.com>
Message-ID: <46AA34EE.4000509@transolutions.net>

I removed the second entry for the IP resource but the IP still doesn't 
failover to the second node. Here is the output from rg_test



Lon Hohberger wrote:Running in test mode.
Loaded 10 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
  address = 192.168.5.4 [ primary unique ]
  monitor_link = 1
  nfslock [ inherit("service%nfslock") ]

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = dolphins-svc-drbd1 [ primary unique required ]
  domain = dolphins-drbd1
  autostart = 1
  recovery = relocate

=== Resource Tree ===
service {
  name = "dolphins-svc-drbd1";
  domain = "dolphins-drbd1";
  autostart = "1";
  recovery = "relocate";
  ip {
    address = "192.168.5.4";
    monitor_link = "1";
    nfslock = "(null)";
  }
}
=== Failover Domains ===
Failover domain: dolphins-drbd1
Flags: Ordered
  Node dolphins (priority 1)
  Node patriots (priority 2)

> On Thu, Jul 26, 2007 at 02:30:05PM -0500, James Wilson wrote:
>   
>> Can someone point why my virtual ip is not failing over. Any help is 
>> appreciated.
>>     
>
>   
>>                <resources>
>>                        <ip address="192.168.5.4" monitor_link="1"/>
>>                </resources>
>>                <service autostart="1" domain="dolphins-drbd1" 
>> name="dolphins-svc-drbd1" recovery="restart">
>>                        <ip ref="192.168.5.4"/>
>>                </service>
>>                <service autostart="1" domain="dolphins-drbd2" 
>> name="dolphins-svc-drbd2" recovery="restart">
>>                        <ip ref="192.168.5.4"/>
>>                </service>
>>     
>
> You can't reference an IP in multiple services; the second reference is
> ignored.  See:
>
>   rg_test test /etc/cluster/cluster.conf &> /tmp/foo.out
>   less /tmp/foo.out
>
> The services should be independent and able to coexist.  If they are not
> cohabitable, you need to use restricted domains to prevent them from
> coexisting.
>
> -- Lon
>
>   



From Dan.Askew at jmsmucker.com  Fri Jul 27 18:26:05 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Fri, 27 Jul 2007 14:26:05 -0400
Subject: [Linux-cluster] Qdisk configuration
In-Reply-To: <46AA34EE.4000509@transolutions.net>
Message-ID: <OFE8B0F0D8.6B4D59D2-ON85257325.0064DA64-85257325.0065BCBF@jmsmucker.com>

Greetings all,

I have a small question on using a quorum disk. Below is me config for the 
quorum disk

              <quorumd device="/dev/sdc1" interval="1" label="qdisk" 
min_score="3" tko="10" votes="3">
                        <heuristic interval="1" program="/bin/ping 
10.1.0.18 -c1 -t1" score="1"/>
              </quorumd>


The device is pressent and the label is correct

root at goober1:/root> e2label /dev/sdc1
qdisk

However when I issue the command "cman_tool nodes" the status does not 
look correct to me

is this correct???

root at goober1:/root> cman_tool nodes
Node  Votes Exp Sts  Name
   0    3    0   X   /dev/sdc1
   1    1    2   M   goober1.na.jmsmucker.com
   2    1    2   M   goober2.na.jmsmucker.com


thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070727/493acbf3/attachment.htm>

From hlawatschek at atix.de  Fri Jul 27 18:34:40 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Fri, 27 Jul 2007 20:34:40 +0200
Subject: [Linux-cluster] rgmanager "depend" tag on RHEL4
In-Reply-To: <20070727134519.GD9112@redhat.com>
References: <200707241621.45892.hlawatschek@atix.de>
	<200707252050.17898.hlawatschek@atix.de>
	<20070727134519.GD9112@redhat.com>
Message-ID: <200707272034.40310.hlawatschek@atix.de>

On Friday 27 July 2007 15:45:19 Lon Hohberger wrote:
> On Wed, Jul 25, 2007 at 08:50:17PM +0200, Mark Hlawatschek wrote:
> > > > I'd like to use the "depend" feature of rgmanager on RHEL4. Is there
> > > > a way to do this. E.g. would it be possible to run the cvs HEAD
> > > > version of rgmanager on a RHEL4.5 cluster infrastructure ?
> > >
> > > It doesn't build on RHEL4 right now; there's no equivalent to
> > > 'cman_get_fenceinfo()'.
> > >
> > > If you #if 0 out the 'node_fenced()' body and return 1, it builds.  I
> > > don't know if it will run... but it builds ;)
> > >
> > > It doesn't appear there's an equivalent function; so to implement it,
> > > one would have to perhaps parse /proc/cluster/services or something
> > > like that...
> >
> > What does the prototype of cman_get_fenceinfo() look like ? What is it
> > meant to do ?
>
> It gets the time the node was last fenced from CMAN.  When a node
> rejoins the cluster, this information is cleared.
>
> So, what rgmanager does is effectively:
>   nodeX died
>   does nodeX have fencing configured?
>     yes
>       wait for node X to be fenced or rejoin the cluster
>     no
>       continue
>
> > Is there another way to get the "depend" feature back-ported ?
>
> I think there's already a feature request open about it, by you?
Yes, there are two feature requests open that address service dependencies
247980 - RFE: (strong and weak) service dependencies
247772 - RFE: One service following another
>
> Implementation-wise, in RHEL5, there's an event queue for service state
> changes, and a function which broadcasts starts/stops
> (broadcast_event()).  The difference is that in RHEL5, you can multicast
> to all rgmanagers w/ one call - whereas on RHEL4, you have to open a
> connection to each and send the message individually.
>
> So, basically, it involves porting the event queue code and implementing
> broadcast_event().
OK, how much effort - in your opinion - would it be to port/implement these 
features ?
Is there a task scheduled to do these things ? 

Thanks,

Mark

-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
Phone: +49-89 452 3538-15
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From mwill at penguincomputing.com  Fri Jul 27 19:06:18 2007
From: mwill at penguincomputing.com (Michael Will)
Date: Fri, 27 Jul 2007 12:06:18 -0700
Subject: [Linux-cluster] iscsi-target server need to be part of cluster?
In-Reply-To: <20070727171111.GF9112@redhat.com>
Message-ID: <433093DF7AD7444DA65EFAFE3987879C4793C8@orca.penguincomputing.com>

But it is questionable if you are achieving what you planned to achieve.


What was the original intent of using GFS and cluster-suite?
1. redundancy in storage / failover and loadbalancing of the file
servers?

Your iscsitarget is now the single point of failure in the storage,
which is identical to having used
only one single NFS server in the first place. 

2. aggragate bandwidth of multiple file servers sharing the same files? 
Aggragate bandwidth using several gfs nodes to access the same data on
the same iscsi target again is unlikely to be higher. After all you now
have the distributed locking overhead, and you still go through the same
iscsitarget pipe. 

Once you have the iscsitarget serve up nonlocal storage (i.e. SAN luns)
and have more than one iscsitarget server, then it becomes interesting
again, and then they would likely be part of the redhat cluster.
d
Michael 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Friday, July 27, 2007 10:11 AM
To: linux clustering
Subject: Re: [Linux-cluster] iscsi-target server need to be part of
cluster?

On Fri, Jul 27, 2007 at 10:33:37AM +0800, Bernard Chew wrote:
> 
> Hi,
> 
> If we have 5 servers with 1 server acting as an iscsi-target for the
remaining 4 servers, do we need to include the iscsi-target server as
part of the cluster? The intention is to export the partition via iscsi
from 1 server and use it as a GFS filesystem for the remaining 4
servers.

No, you don't need machines acting solely as storage targets to be part
of the cluster.

--
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From srigler at marathonoil.com  Fri Jul 27 19:21:37 2007
From: srigler at marathonoil.com (Steve Rigler)
Date: Fri, 27 Jul 2007 14:21:37 -0500
Subject: [Linux-cluster] Cluster Node Crash
Message-ID: <1185564097.23636.61.camel@houuc8>

Hello All,

We are running GFS on RHEL4U3 (x86_64).  One of our cluster nodes
crashes this afternoon.  We are able to capture some of the message from
netdump (pasted below) before fencing killed the node.

Any advice would be appreciated.

Thanks,
Steve


lock_dlm:  Assertion failed on line 357 of file /usr/src/build/714650-
x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 5441671088
HOME: error=-22 num=3,bcd97c3 lkf=9 flags=84
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:357
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: lockd parport netconsole netdump lock_dlm(U)(U) dlm
md5 ipv6 battery ac tg3 floppy ext3 qla2300 scsi_transport_fc scsi_mod
Pid: 3221, comm: gfs_glockd Not tainted 2.6.9-34.ELsmp
RSP: 0018:00000101123c1dd8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 000001004df39d80 RCX: 0000000000000246
RDX: 000000000000a997 RSI: 0000000000000246 RDI: ffffffff803d9e60
RBP: 00000000ffffffea R08: 0000000000000004 R09: 000001004df39d80
R10: 0000000000000000 R11: 00000000000000e4 R12: 00000100d4a9a974
R13: ffffff0010182000 R14: ffffffffa0264e60 R15: 00000100d4a9a948
FS:  0000002a95575b00(0000) GS:ffffffff804d7b00(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000062fafc CR3: 0000000000101000 CR4: 00000000000006e0
Stack: 000001001d546c80 ffffff0010182000        00000100d4a9a948
ffffffffa0268b96

Call Trace:<ffffffffa022f97c>{:gfs:gfs_lm_unlock+41}
<ffffffffa02263d9>{:gfs:gfs_glock_drop_th+290}
<ffffffffa0224b7c>{:gfs:run_queue+314}
<ffffffffa0224dd0>{:gfs:unlock_on_glock+37}
<ffffffffa0224ec6>{:gfs:gfs_reclaim_glock+234}
<ffffffffa021975a>{:gfs:gfs_glockd+61}
<ffffffff801333c8>{default_wake_function+0}
       <ffffffff801333c8>{default_wake_function+0}
<ffffffff80110e17>{child_rip+8}
       <ffffffff801f64af>{vgacon_cursor+0}
<ffffffffa021971d>{:gfs:gfs_glockd+0}
       <ffffffff80110e0f>{child_rip+0}

Code: 0f 0b 52 d1 26 a0 ff ff ff ff 65 01 48 c7 c7 57 d1 26 a0 31
RIP <ffffffffa0268819>{:lock_dlm:do_dlm_unlock+167} RSP
<00000101123c1dd8>



From srigler at marathonoil.com  Fri Jul 27 19:28:49 2007
From: srigler at marathonoil.com (Steve Rigler)
Date: Fri, 27 Jul 2007 14:28:49 -0500
Subject: [Linux-cluster] Cluster Node Crash
In-Reply-To: <1185564097.23636.61.camel@houuc8>
References: <1185564097.23636.61.camel@houuc8>
Message-ID: <1185564529.23636.65.camel@houuc8>

On Fri, 2007-07-27 at 14:21 -0500, Steve Rigler wrote:
> Hello All,
> 
> We are running GFS on RHEL4U3 (x86_64).  One of our cluster nodes
> crashes this afternoon.  We are able to capture some of the message from
> netdump (pasted below) before fencing killed the node.
> 
> Any advice would be appreciated.
> 
> Thanks,
> Steve
> 
> 

As a followup, this is past tense (the word "crashes" should have been
"crashed").  One of the other nodes panicked after the first one tried
to rejoin the cluster (this is a 3 node cluster).

The dump from that node had these messages near the beginning of its
crash:
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown
SM: 00000001 sm_stop: SG still joined
SM: 01000002 sm_stop: SG still joined
SM: 02000004 sm_stop: SG still joined
SM: 0300000d sm_stop: SG still joined

Followed by this:

lock_dlm:  Assertion failed on line 428 of file /usr/src/build/714650-
x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 5442621324
STUL03E: num=1,2 err=-22 cur=-1 req=3 lkf=0

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: nfsd exportfs nfs lockd nfs_acl parport_pc lp parport
netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U)
lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core dm_mirror dm_round_robin dm_multipath button battery ac
uhci_hcd ehci_hcd hw_random tg3 floppy ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc cciss sd_mod scsi_mod
Pid: 30604, comm: umount Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa02689e7>] <ffffffffa02689e7>{:lock_dlm:do_dlm_lock
+365}
RSP: 0018:000001002ab6dc38  EFLAGS: 00010216
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000000246
RDX: 000000000000996e RSI: 0000000000000246 RDI: ffffffff803d9e60
RBP: 0000010117945c80 R08: 0000000000000004 R09: 00000000ffffffea
R10: 0000000000000000 R11: 00000000000000e4 R12: 00000100dfd23400
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000003
FS:  0000002a95575b00(0000) GS:ffffffff804d7b00(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003f95fc60c0 CR3: 0000000000101000 CR4: 00000000000006e0
Process umount (pid: 30604, threadinfo 000001002ab6c000, task
00000101120da030)
Stack: 0000000000000003 0000000000000000 3120202020202020
2020202020202020
       3220202020202020 0000000000000018 0000010117945c80
0000000000000000
       0000000000000003 0000000000000000
Call Trace:<ffffffffa0268b2a>{:lock_dlm:lm_dlm_lock+214}
<ffffffffa022f93f>{:gfs:gfs_lm_lock+50}
       <ffffffffa02269da>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa0224cdd>{:gfs:run_queue+667}
       <ffffffffa0225ccf>{:gfs:gfs_glock_nq+938}
<ffffffffa0225f11>{:gfs:gfs_glock_nq_init+20}
       <ffffffffa024629b>{:gfs:gfs_make_fs_ro+39}
<ffffffffa023e508>{:gfs:gfs_put_super+630}
       <ffffffff8017d0c9>{generic_shutdown_super+202}
<ffffffffa023c009>{:gfs:gfs_kill_sb+42}
       <ffffffff801ccb78>{dummy_inode_permission+0}
<ffffffff8017cfe6>{deactivate_super+95}
       <ffffffff80192537>{sys_umount+925} <ffffffff80180264>{sys_newstat
+17}
       <ffffffff80110c61>{error_exit+0} <ffffffff801101c6>{system_call
+126}



From jwilson at transolutions.net  Fri Jul 27 20:35:14 2007
From: jwilson at transolutions.net (James Wilson)
Date: Fri, 27 Jul 2007 15:35:14 -0500
Subject: [Linux-cluster] IP not failing over
In-Reply-To: <46AA34EE.4000509@transolutions.net>
References: <46A8F63D.8000603@transolutions.net>	<20070727170942.GE9112@redhat.com>
	<46AA34EE.4000509@transolutions.net>
Message-ID: <46AA5702.6010002@transolutions.net>

Would it be easier to create a script to bing up the virtual ip and have 
rgmanager start the script on failover?

James Wilson wrote:
> I removed the second entry for the IP resource but the IP still 
> doesn't failover to the second node. Here is the output from rg_test
>
>
>
> Lon Hohberger wrote:Running in test mode.
> Loaded 10 resource rules
> === Resources List ===
> Resource type: ip
> Instances: 1/1
> Agent: ip.sh
> Attributes:
>  address = 192.168.5.4 [ primary unique ]
>  monitor_link = 1
>  nfslock [ inherit("service%nfslock") ]
>
> Resource type: service [INLINE]
> Instances: 1/1
> Agent: service.sh
> Attributes:
>  name = dolphins-svc-drbd1 [ primary unique required ]
>  domain = dolphins-drbd1
>  autostart = 1
>  recovery = relocate
>
> === Resource Tree ===
> service {
>  name = "dolphins-svc-drbd1";
>  domain = "dolphins-drbd1";
>  autostart = "1";
>  recovery = "relocate";
>  ip {
>    address = "192.168.5.4";
>    monitor_link = "1";
>    nfslock = "(null)";
>  }
> }
> === Failover Domains ===
> Failover domain: dolphins-drbd1
> Flags: Ordered
>  Node dolphins (priority 1)
>  Node patriots (priority 2)
>
>> On Thu, Jul 26, 2007 at 02:30:05PM -0500, James Wilson wrote:
>>  
>>> Can someone point why my virtual ip is not failing over. Any help is 
>>> appreciated.
>>>     
>>
>>  
>>>                <resources>
>>>                        <ip address="192.168.5.4" monitor_link="1"/>
>>>                </resources>
>>>                <service autostart="1" domain="dolphins-drbd1" 
>>> name="dolphins-svc-drbd1" recovery="restart">
>>>                        <ip ref="192.168.5.4"/>
>>>                </service>
>>>                <service autostart="1" domain="dolphins-drbd2" 
>>> name="dolphins-svc-drbd2" recovery="restart">
>>>                        <ip ref="192.168.5.4"/>
>>>                </service>
>>>     
>>
>> You can't reference an IP in multiple services; the second reference is
>> ignored.  See:
>>
>>   rg_test test /etc/cluster/cluster.conf &> /tmp/foo.out
>>   less /tmp/foo.out
>>
>> The services should be independent and able to coexist.  If they are not
>> cohabitable, you need to use restricted domains to prevent them from
>> coexisting.
>>
>> -- Lon
>>
>>   
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From chiranthlk at yahoo.com  Sat Jul 28 07:15:57 2007
From: chiranthlk at yahoo.com (chirantha pitigala)
Date: Sat, 28 Jul 2007 00:15:57 -0700 (PDT)
Subject: [Linux-cluster] RGMANAGER segmentation fault
Message-ID: <727995.29330.qm@web52301.mail.re2.yahoo.com>

Hi,

I checked the errata but was not able to find out any rgamanger related fixes. What is the bug id you are refering to?

best regards
chirantha

----- Original Message ----
From: Lon Hohberger <lhh at redhat.com>
To: linux-cluster at redhat.com
Sent: Wednesday, July 25, 2007 10:10:51 PM
Subject: Re: [Linux-cluster] RGMANAGER segmentation fault

On Tue, Jul 17, 2007 at 06:44:03PM -0700, chirantha pitigala wrote:
> Hi all,
> 
> We have received segmentation fault at rgmanager several times while restarting a service. (Not frequent). After that automatic restart of the server happened.
> My OS is RHEL 4 Update 4 (2.6.9-42.ELsmp)
> rgmanager-1.9.54-1
> All the cluster packages are RHEL Update4.
> I saw this bug has been fixed in RHEL Update3. 
> 
> Log is as followed.
> 
> Jul 16 11:09:50 UI1 clurgmgrd[3631]: <notice> Stopping service reportgenerator Jul 16 11:09:51 UI1 clurgmgrd: [3631]: <info> Executing /etc/init.d/reportgenerator stop Jul 16 11:09:52 UI1 snmptrapd[3478]: 2007-07-16 11:09:52 192.168.40.1
> [192.168.40.1]: SNMPv2-MIB::sysUpTime.0 = Timeticks: (283880832) 32 days,
> 20:33:28.32     SNMPv2-MIB::snmpTrapOID.0 = OID: UCD-SNMP-MIB::linux   
> SNMPv2-SMI::private.2021.2.1.101 = STRING: "ReportGenerator stopped on UI1"
> Jul 16 11:09:53 UI1 clurgmgrd[3631]: <notice> Service reportgenerator is stopped Jul 16 11:09:53 UI1 kernel: clurgmgrd[28708]: segfault at 0000000000000050 rip 000000000040399e rsp 0000000043245010 error 4 Jul 16 11:10:01 UI1 crond(pam_unix)[32385]: session opened for user root by (uid=0) Jul 16 11:10:01 UI1 crond(pam_unix)[32387]: session opened for user ccs by

There's a couple of fixed crash-bugs in 4.5; you need to update
rgmanager, ccs, magma-plugins, and magma.  See the various errata for
4.5 to see what's specifically fixed.


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster







       
____________________________________________________________________________________
Get the free Yahoo! toolbar and rest assured with the added security of spyware protection.
http://new.toolbar.yahoo.com/toolbar/features/norton/index.php
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070728/8dddbdcb/attachment.htm>

From ram.r.rajesh at gmail.com  Sat Jul 28 08:34:42 2007
From: ram.r.rajesh at gmail.com (ram kumar)
Date: Sat, 28 Jul 2007 14:04:42 +0530
Subject: [Linux-cluster] GFS installation
In-Reply-To: <314252.52890.qm@web32209.mail.mud.yahoo.com>
References: <1310407c0707260845l1ff315ddtf1214b12c0bb5386@mail.gmail.com>
	<314252.52890.qm@web32209.mail.mud.yahoo.com>
Message-ID: <1310407c0707280134k4bbd0095k4f89cf36862c6a35@mail.gmail.com>

Its an hp-ux machine.

Will it be possible with this machine

Regards,
Ram


On 7/26/07, Hal <hal_bg at yahoo.com> wrote:
>
> What kind of Unix?
> The Redhat cluster is for linux only.
>
> Hal
>
> --- ram kumar <ram.r.rajesh at gmail.com> wrote:
>
> > Hi,
> >
> > can anyone guide me to install GFS in an unix machine.
> >
> > I am very new even to unix machine..
> >
> > Kindly let me know the exact steps that i need to follow from the
> begining.
> >
> > I used that usage.txt file from red-hat website.
> >
> > But i getting an error while using make command.
> > I dont hve GNU package in my machine..
> >
> > Kindly let me know whether the procedure i follow is correct,please let
> me
> > know if any prerquistes have to be done.
> >
> > Regards,
> > Ram
> > > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
>
> ____________________________________________________________________________________
> Be a better Heartthrob. Get better relationship answers from someone who
> knows. Yahoo! Answers - Check it out.
> http://answers.yahoo.com/dir/?link=list&sid=396545433
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070728/34eaeda0/attachment.htm>

From rainer at ultra-secure.de  Sat Jul 28 09:32:16 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Sat, 28 Jul 2007 11:32:16 +0200
Subject: [Linux-cluster] GFS installation
In-Reply-To: <1310407c0707280134k4bbd0095k4f89cf36862c6a35@mail.gmail.com>
References: <1310407c0707260845l1ff315ddtf1214b12c0bb5386@mail.gmail.com>
	<314252.52890.qm@web32209.mail.mud.yahoo.com>
	<1310407c0707280134k4bbd0095k4f89cf36862c6a35@mail.gmail.com>
Message-ID: <9DB960B6-32E4-4B4C-986B-808EE6800A11@ultra-secure.de>


Am 28.07.2007 um 10:34 schrieb ram kumar:

> Its an hp-ux machine.


;-)


>
> Will it be possible with this machine
>


No.
Veritas^wSymantec seems to sell something that is called "Veritas  
Storage Foundation Cluster Filesystem"
http://www.symantec.com/enterprise/products/overview.jsp? 
pcid=2245&pvid=209_1
Supports HPUX 11iv2.


cheers,
Rainer
-- 
Rainer Duffner
CISSP, LPI, MCSE
rainer at ultra-secure.de




From alkol6 at gmail.com  Sat Jul 28 10:51:31 2007
From: alkol6 at gmail.com (Senol Erdogan)
Date: Sat, 28 Jul 2007 13:51:31 +0300
Subject: [Linux-cluster] GFS installation
In-Reply-To: <9DB960B6-32E4-4B4C-986B-808EE6800A11@ultra-secure.de>
References: <1310407c0707260845l1ff315ddtf1214b12c0bb5386@mail.gmail.com>
	<314252.52890.qm@web32209.mail.mud.yahoo.com>
	<1310407c0707280134k4bbd0095k4f89cf36862c6a35@mail.gmail.com>
	<9DB960B6-32E4-4B4C-986B-808EE6800A11@ultra-secure.de>
Message-ID: <93bf230a0707280351t46e4f73cv470022761d970ada@mail.gmail.com>

look up this link.
ftp://sources.redhat.com/pub/cluster/releases/

br



2007/7/28, Rainer Duffner <rainer at ultra-secure.de>:
>
>
> Am 28.07.2007 um 10:34 schrieb ram kumar:
>
> > Its an hp-ux machine.
>
>
> ;-)
>
>
> >
> > Will it be possible with this machine
> >
>
>
> No.
> Veritas^wSymantec seems to sell something that is called "Veritas
> Storage Foundation Cluster Filesystem"
> http://www.symantec.com/enterprise/products/overview.jsp?
> pcid=2245&pvid=209_1
> Supports HPUX 11iv2.
>
>
> cheers,
> Rainer
> --
> Rainer Duffner
> CISSP, LPI, MCSE
> rainer at ultra-secure.de
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070728/765369eb/attachment.htm>

From habib at isl-bd.com  Sat Jul 28 11:13:44 2007
From: habib at isl-bd.com (habib)
Date: Sat, 28 Jul 2007 17:13:44 +0600
Subject: [Linux-cluster] sendmail cluster problem.
Message-ID: <WorldClient-F200707281713.AA13440009@isl-bd.com>

Hi,

we are facing a problem with redhat cluster in Sendmail service.we have 
two node cluster . where we using cluster suite.when we change the 
cluster service to one node to another nod. them the problem is occure 
in mail storage .we are useing IMAP. with web mail.we already check the 
mail storage, every mail is on the common storage. after change the mail 
cluster service one node to other node it will problem.

we check the user UID .where user UID is same in both server. it is 
working fine. which is different from on to another it make problem
.
Server DELL PE 2950 server 
Storage : Dell EMC AX150.


pls give me the solutions for this problem.

thanks 

habib



From ram.r.rajesh at gmail.com  Sat Jul 28 11:59:26 2007
From: ram.r.rajesh at gmail.com (ram kumar)
Date: Sat, 28 Jul 2007 17:29:26 +0530
Subject: [Linux-cluster] GFS installation
In-Reply-To: <9DB960B6-32E4-4B4C-986B-808EE6800A11@ultra-secure.de>
References: <1310407c0707260845l1ff315ddtf1214b12c0bb5386@mail.gmail.com>
	<314252.52890.qm@web32209.mail.mud.yahoo.com>
	<1310407c0707280134k4bbd0095k4f89cf36862c6a35@mail.gmail.com>
	<9DB960B6-32E4-4B4C-986B-808EE6800A11@ultra-secure.de>
Message-ID: <1310407c0707280459y1e39a520pad415d92ae25b86f@mail.gmail.com>

Thanks rainer.

Regards,
Ram



On 7/28/07, Rainer Duffner <rainer at ultra-secure.de> wrote:
>
>
> Am 28.07.2007 um 10:34 schrieb ram kumar:
>
> > Its an hp-ux machine.
>
>
> ;-)
>
>
> >
> > Will it be possible with this machine
> >
>
>
> No.
> Veritas^wSymantec seems to sell something that is called "Veritas
> Storage Foundation Cluster Filesystem"
> http://www.symantec.com/enterprise/products/overview.jsp?
> pcid=2245&pvid=209_1
> Supports HPUX 11iv2.
>
>
> cheers,
> Rainer
> --
> Rainer Duffner
> CISSP, LPI, MCSE
> rainer at ultra-secure.de
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070728/55cea867/attachment.htm>

From chris at cmiware.com  Sat Jul 28 19:38:56 2007
From: chris at cmiware.com (Chris Harms)
Date: Sat, 28 Jul 2007 14:38:56 -0500
Subject: [Linux-cluster] sendmail cluster problem.
In-Reply-To: <WorldClient-F200707281713.AA13440009@isl-bd.com>
References: <WorldClient-F200707281713.AA13440009@isl-bd.com>
Message-ID: <46AB9B50.9040302@cmiware.com>

This is very vague.  What IMAP server are you running?  What errors does 
Sendmail give when it switches and can't deliver?  Absent that 
information, a guess is that you will need to fail the IMAP service with 
Sendmail if they are running on the same machine.

habib wrote:
> Hi,
>
> we are facing a problem with redhat cluster in Sendmail service.we have 
> two node cluster . where we using cluster suite.when we change the 
> cluster service to one node to another nod. them the problem is occure 
> in mail storage .we are useing IMAP. with web mail.we already check the 
> mail storage, every mail is on the common storage. after change the mail 
> cluster service one node to other node it will problem.
>
> we check the user UID .where user UID is same in both server. it is 
> working fine. which is different from on to another it make problem
> .
> Server DELL PE 2950 server 
> Storage : Dell EMC AX150.
>
>
> pls give me the solutions for this problem.
>
> thanks 
>
> habib
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From mhanafi at csc.com  Sat Jul 28 20:03:09 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Sat, 28 Jul 2007 16:03:09 -0400
Subject: [Linux-cluster] GFS i/o Request size
Message-ID: <OFCE3CCC49.E35DCEFE-ON85257326.006E1D6C-85257326.006E2A29@csc.com>

It seem that GFS likes to do all I/O in 512KB sizes. Is there a way to 
increase this? I like to do I/O sizes of 1024 or 2048KB. 

thanks,
Mahmoud


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070728/8034b703/attachment.htm>

From mhanafi at csc.com  Sat Jul 28 20:04:03 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Sat, 28 Jul 2007 16:04:03 -0400
Subject: [Linux-cluster] lvremove error
Message-ID: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>

I am trying to remove a lv via cmd line but I get the following error.

[root at nfs1 /]# lvremove /dev/gfs_vg/gfs_st2_sz512
Do you really want to remove active logical volume "gfs_st2_sz512"? [y/n]: 
y
  Error locking on node nfs1.local: Volume is busy on another node
  Can't get exclusive access to volume "gfs_st2_sz512"

But I can remove lv via the GUI (luci). 

What am I doing wrong on the cmd line.



Thanks,
Mahmoud

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070728/49ecc894/attachment.htm>

From darkblue2000 at gmail.com  Sun Jul 29 08:42:11 2007
From: darkblue2000 at gmail.com (darkblue)
Date: Sun, 29 Jul 2007 16:42:11 +0800
Subject: [Linux-cluster] =?utf-8?q?where_can_I_download_rhcs=2Eiso?=
	=?utf-8?b?77yf?=
Message-ID: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>

I have a rhn account, but I can't find the rhcs.iso for as4u5, anyone
know where is it?

-- 
He is nothing



From irwan at magnifix.com.my  Sun Jul 29 08:59:07 2007
From: irwan at magnifix.com.my (Mohd Irwan)
Date: Sun, 29 Jul 2007 16:59:07 +0800
Subject: [Linux-cluster] where can I download =?UTF-8?Q?rhcs=2Eiso?=
	=?UTF-8?Q?=EF=BC=9F?=
In-Reply-To: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>
References: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>
Message-ID: <1185699547.30014.9.camel@kuli.magnifix.com.my>

On Sun, 2007-07-29 at 16:42 +0800, darkblue wrote:
> I have a rhn account, but I can't find the rhcs.iso for as4u5, anyone
> know where is it?

Provided you have an active subscription for RHCS, it should be at 
RHN Website -> Channel -> RHEL AS (version 4) -> RHCS -> Download

-- 
Regards,
+--------------------------------+-------------------------------------+
|       Mohd Irwan Jamaluddin    |  "Being a Bayern Munich fan is like |
| ##    System Engineer,         |   a love affair. If you don't take  |
| (o_   Magnifix Sdn. Bhd.       |   it seriously, it is no fun; if    |
| //\   Tel: +60 3 42705073      |   you do take it seriously, it      |
| V_/_  Fax: +60 3 42701960      |   breaks your heart."               |
|       http://www.magnifix.com/ |                                     |
+--------------------------------+-------------------------------------+



From darkblue2000 at gmail.com  Sun Jul 29 09:16:55 2007
From: darkblue2000 at gmail.com (darkblue)
Date: Sun, 29 Jul 2007 17:16:55 +0800
Subject: =?UTF-8?Q?Re:_[Linux-cluster]_where_can_I_download_rhcs.iso=EF=BC=9F?=
In-Reply-To: <1185699547.30014.9.camel@kuli.magnifix.com.my>
References: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>
	<1185699547.30014.9.camel@kuli.magnifix.com.my>
Message-ID: <2c8195ff0707290216x1a27fca0sdeeb139e776ed9a4@mail.gmail.com>

no, I can't find it there, the iso list like this:

 Red Hat Enterprise Linux AS (v. 4 for 32-bit x86)
   Red Hat Application Server v. 2 (AS v. 4 for i386)  742  0
   Red Hat Application Server v. 2 Beta (AS v. 4 for i386)  0  0
   Red Hat Developer Suite v. 2.1 (AS v. 4 for i386)  15  0
   Red Hat Developer Suite v. 2.1 Beta (AS v. 4 for i386)  0  0
   Red Hat Developer Suite v. 3 (AS v. 4 for i386)  145  0
   Red Hat Network Tools for RHEL AS (v.4 for x86)  1252  0
   Red Hat Web Application Stack 1.0 Beta (for AS v. 4 x86)  56  0
   RHEL AS (v. 4 for x86) Beta  4  0
   RHEL AS (v. 4 for x86) Extras  124  0
   RHEL AS (v. 4 for x86) Extras Beta  0  0
   RHEL AS (v. 4 for x86) Fastrack  127  0
   RHEL AS (v. 4 for x86) SDK Beta
you can check it out with your rhn account.


2007/7/29, Mohd Irwan <irwan at magnifix.com.my>:
> On Sun, 2007-07-29 at 16:42 +0800, darkblue wrote:
> > I have a rhn account, but I can't find the rhcs.iso for as4u5, anyone
> > know where is it?
>
> Provided you have an active subscription for RHCS, it should be at
> RHN Website -> Channel -> RHEL AS (version 4) -> RHCS -> Download
>
> --
> Regards,
> +--------------------------------+-------------------------------------+
> |       Mohd Irwan Jamaluddin    |  "Being a Bayern Munich fan is like |
> | ##    System Engineer,         |   a love affair. If you don't take  |
> | (o_   Magnifix Sdn. Bhd.       |   it seriously, it is no fun; if    |
> | //\   Tel: +60 3 42705073      |   you do take it seriously, it      |
> | V_/_  Fax: +60 3 42701960      |   breaks your heart."               |
> |       http://www.magnifix.com/ |                                     |
> +--------------------------------+-------------------------------------+
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
He is nothing



From irwan at magnifix.com.my  Sun Jul 29 09:35:15 2007
From: irwan at magnifix.com.my (Mohd Irwan)
Date: Sun, 29 Jul 2007 17:35:15 +0800
Subject: [Linux-cluster] where can I download =?UTF-8?Q?rhcs=2Eiso?=
	=?UTF-8?Q?=EF=BC=9F?=
In-Reply-To: <2c8195ff0707290216x1a27fca0sdeeb139e776ed9a4@mail.gmail.com>
References: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>
	<1185699547.30014.9.camel@kuli.magnifix.com.my>
	<2c8195ff0707290216x1a27fca0sdeeb139e776ed9a4@mail.gmail.com>
Message-ID: <1185701715.30014.12.camel@kuli.magnifix.com.my>

On Sun, 2007-07-29 at 17:16 +0800, darkblue wrote:
> no, I can't find it there, the iso list like this:
> 
>  Red Hat Enterprise Linux AS (v. 4 for 32-bit x86)
>    Red Hat Application Server v. 2 (AS v. 4 for i386)  742  0
>    Red Hat Application Server v. 2 Beta (AS v. 4 for i386)  0  0
>    Red Hat Developer Suite v. 2.1 (AS v. 4 for i386)  15  0
>    Red Hat Developer Suite v. 2.1 Beta (AS v. 4 for i386)  0  0
>    Red Hat Developer Suite v. 3 (AS v. 4 for i386)  145  0
>    Red Hat Network Tools for RHEL AS (v.4 for x86)  1252  0
>    Red Hat Web Application Stack 1.0 Beta (for AS v. 4 x86)  56  0
>    RHEL AS (v. 4 for x86) Beta  4  0
>    RHEL AS (v. 4 for x86) Extras  124  0
>    RHEL AS (v. 4 for x86) Extras Beta  0  0
>    RHEL AS (v. 4 for x86) Fastrack  127  0
>    RHEL AS (v. 4 for x86) SDK Beta
> you can check it out with your rhn account.

It is because of you do not have an active subscription for RHCS.

-- 
Regards,
+--------------------------------+-------------------------------------+
|       Mohd Irwan Jamaluddin    |  "Being a Bayern Munich fan is like |
| ##    System Engineer,         |   a love affair. If you don't take  |
| (o_   Magnifix Sdn. Bhd.       |   it seriously, it is no fun; if    |
| //\   Tel: +60 3 42705073      |   you do take it seriously, it      |
| V_/_  Fax: +60 3 42701960      |   breaks your heart."               |
|       http://www.magnifix.com/ |                                     |
+--------------------------------+-------------------------------------+



From bernard.chew at muvee.com  Sun Jul 29 10:01:31 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Sun, 29 Jul 2007 18:01:31 +0800
Subject: [Linux-cluster] iscsi-target server need to be part of cluster ?
Message-ID: <015001c7d1c6$940354e1$1e59a8c0@sg.muvee.net>

Hi Micheal and Lon,

Thank you for the quick replies. The mentioned setup is in  a staging environment; the iscsi target server will be replaced by a iscsi san in actual deployment. I wanted to ensure the iscsi target server is deployed correctly for testing.

Regards,
Bernard Chew

--- original message ---
From: "Michael Will" <mwill at penguincomputing.com>
Subject: RE: [Linux-cluster] iscsi-target server need to be part of cluster?
Date: 28th July 2007
Time: 3:6:29 

But it is questionable if you are achieving what you planned to achieve.


What was the original intent of using GFS and cluster-suite?
1. redundancy in storage / failover and loadbalancing of the file
servers?

Your iscsitarget is now the single point of failure in the storage,
which is identical to having used
only one single NFS server in the first place. 

2. aggragate bandwidth of multiple file servers sharing the same files? 
Aggragate bandwidth using several gfs nodes to access the same data on
the same iscsi target again is unlikely to be higher. After all you now
have the distributed locking overhead, and you still go through the same
iscsitarget pipe. 

Once you have the iscsitarget serve up nonlocal storage (i.e. SAN luns)
and have more than one iscsitarget server, then it becomes interesting
again, and then they would likely be part of the redhat cluster.
d
Michael 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Friday, July 27, 2007 10:11 AM
To: linux clustering
Subject: Re: [Linux-cluster] iscsi-target server need to be part of
cluster?

On Fri, Jul 27, 2007 at 10:33:37AM +0800, Bernard Chew wrote:
> 
> Hi,
> 
> If we have 5 servers with 1 server acting as an iscsi-target for the
remaining 4 servers, do we need to include the iscsi-target server as
part of the cluster? The intention is to export the partition via iscsi
from 1 server and use it as a GFS filesystem for the remaining 4
servers.

No, you don't need machines acting solely as storage targets to be part
of the cluster.

--
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From darkblue2000 at gmail.com  Sun Jul 29 10:22:25 2007
From: darkblue2000 at gmail.com (darkblue)
Date: Sun, 29 Jul 2007 18:22:25 +0800
Subject: =?UTF-8?Q?Re:_[Linux-cluster]_where_can_I_download_rhcs.iso=EF=BC=9F?=
In-Reply-To: <1185701715.30014.12.camel@kuli.magnifix.com.my>
References: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>
	<1185699547.30014.9.camel@kuli.magnifix.com.my>
	<2c8195ff0707290216x1a27fca0sdeeb139e776ed9a4@mail.gmail.com>
	<1185701715.30014.12.camel@kuli.magnifix.com.my>
Message-ID: <2c8195ff0707290322i733590faq7dd2695b09c247dd@mail.gmail.com>

I have an 30-day unsupport evalution subscription, and already actived it.
but still can't access rhcs channel. any body know how to do it?

2007/7/29, Mohd Irwan <irwan at magnifix.com.my>:
> On Sun, 2007-07-29 at 17:16 +0800, darkblue wrote:
> > no, I can't find it there, the iso list like this:
> >
> >  Red Hat Enterprise Linux AS (v. 4 for 32-bit x86)
> >    Red Hat Application Server v. 2 (AS v. 4 for i386)  742  0
> >    Red Hat Application Server v. 2 Beta (AS v. 4 for i386)  0  0
> >    Red Hat Developer Suite v. 2.1 (AS v. 4 for i386)  15  0
> >    Red Hat Developer Suite v. 2.1 Beta (AS v. 4 for i386)  0  0
> >    Red Hat Developer Suite v. 3 (AS v. 4 for i386)  145  0
> >    Red Hat Network Tools for RHEL AS (v.4 for x86)  1252  0
> >    Red Hat Web Application Stack 1.0 Beta (for AS v. 4 x86)  56  0
> >    RHEL AS (v. 4 for x86) Beta  4  0
> >    RHEL AS (v. 4 for x86) Extras  124  0
> >    RHEL AS (v. 4 for x86) Extras Beta  0  0
> >    RHEL AS (v. 4 for x86) Fastrack  127  0
> >    RHEL AS (v. 4 for x86) SDK Beta
> > you can check it out with your rhn account.
>
> It is because of you do not have an active subscription for RHCS.
>
> --
> Regards,
> +--------------------------------+-------------------------------------+
> |       Mohd Irwan Jamaluddin    |  "Being a Bayern Munich fan is like |
> | ##    System Engineer,         |   a love affair. If you don't take  |
> | (o_   Magnifix Sdn. Bhd.       |   it seriously, it is no fun; if    |
> | //\   Tel: +60 3 42705073      |   you do take it seriously, it      |
> | V_/_  Fax: +60 3 42701960      |   breaks your heart."               |
> |       http://www.magnifix.com/ |                                     |
> +--------------------------------+-------------------------------------+
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
He is nothing



From rainer at ultra-secure.de  Sun Jul 29 14:47:23 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Sun, 29 Jul 2007 16:47:23 +0200
Subject: =?UTF-8?Q?Re:_[Linux-cluster]_where_can_I_download_rhcs.iso?=
	=?UTF-8?Q?=EF=BC=9F?=
In-Reply-To: <2c8195ff0707290322i733590faq7dd2695b09c247dd@mail.gmail.com>
References: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>
	<1185699547.30014.9.camel@kuli.magnifix.com.my>
	<2c8195ff0707290216x1a27fca0sdeeb139e776ed9a4@mail.gmail.com>
	<1185701715.30014.12.camel@kuli.magnifix.com.my>
	<2c8195ff0707290322i733590faq7dd2695b09c247dd@mail.gmail.com>
Message-ID: <25B35513-630F-49D9-9B6A-372DF33080B2@ultra-secure.de>


Am 29.07.2007 um 12:22 schrieb darkblue:

> I have an 30-day unsupport evalution subscription, and already  
> actived it.
> but still can't access rhcs channel. any body know how to do it?
>


Phone their customer-support on monday morning and complain.
Or call your sales-rep.
Because: my guess is the people with @redhat addresses who post here  
don't have much influence on your account either ;-)



cheers,
Rainer
-- 
Rainer Duffner
CISSP, LPI, MCSE
rainer at ultra-secure.de




From darkblue2000 at gmail.com  Mon Jul 30 00:12:43 2007
From: darkblue2000 at gmail.com (darkblue)
Date: Mon, 30 Jul 2007 08:12:43 +0800
Subject: =?UTF-8?Q?[Linux-cluster]_where_can_I_download_rhcs.iso=EF=BC=9F?=
In-Reply-To: <2c8195ff0707290239m91915e8i53a41f7d68533eeb@mail.gmail.com>
References: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>
	<1185699547.30014.9.camel@kuli.magnifix.com.my>
	<2c8195ff0707290216x1a27fca0sdeeb139e776ed9a4@mail.gmail.com>
	<1185701385.30014.11.camel@kuli.magnifix.com.my>
	<2c8195ff0707290239m91915e8i53a41f7d68533eeb@mail.gmail.com>
Message-ID: <2c8195ff0707291712q2af46f0bj89881743bdb7ac88@mail.gmail.com>

hmm, I am downloading the src.rpm packages from
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/RHCS/SRPMS
the packages are:
ccs-1.0.10-0.src.rpm
ccs-1.0.2-0.src.rpm
ccs-1.0.3-0.src.rpm
ccs-1.0.7-0.src.rpm
clustermon-0.9.1-8.src.rpm
cman-1.0.11-0.src.rpm
cman-1.0.17-0.src.rpm
cman-1.0.2-0.src.rpm
cman-1.0.4-0.src.rpm
cman-kernel-2.6.9-39.5.src.rpm
cman-kernel-2.6.9-39.8.src.rpm
cman-kernel-2.6.9-41.0.2.src.rpm
cman-kernel-2.6.9-41.0.src.rpm
cman-kernel-2.6.9-43.8.3.src.rpm
cman-kernel-2.6.9-43.8.5.src.rpm
cman-kernel-2.6.9-43.8.src.rpm
cman-kernel-2.6.9-45.14.src.rpm
cman-kernel-2.6.9-45.15.src.rpm
cman-kernel-2.6.9-45.2.src.rpm
cman-kernel-2.6.9-45.3.src.rpm
cman-kernel-2.6.9-45.4.src.rpm
cman-kernel-2.6.9-45.5.src.rpm
cman-kernel-2.6.9-45.8.src.rpm
cman-kernel-2.6.9-50.2.0.1.src.rpm
cman-kernel-2.6.9-50.2.src.rpm
conga-0.9.1-8.src.rpm
dlm-1.0.0-5.src.rpm
dlm-1.0.1-1.src.rpm
dlm-1.0.3-1.src.rpm
dlm-kernel-2.6.9-37.7.src.rpm
dlm-kernel-2.6.9-37.9.src.rpm
dlm-kernel-2.6.9-39.1.2.src.rpm
dlm-kernel-2.6.9-39.1.src.rpm
dlm-kernel-2.6.9-41.7.1.src.rpm
dlm-kernel-2.6.9-41.7.2.src.rpm
dlm-kernel-2.6.9-41.7.src.rpm
dlm-kernel-2.6.9-42.10.src.rpm
dlm-kernel-2.6.9-42.11.src.rpm
dlm-kernel-2.6.9-42.12.src.rpm
dlm-kernel-2.6.9-42.13.src.rpm
dlm-kernel-2.6.9-44.2.src.rpm
dlm-kernel-2.6.9-44.3.src.rpm
dlm-kernel-2.6.9-44.8.src.rpm
dlm-kernel-2.6.9-44.9.src.rpm
dlm-kernel-2.6.9-46.16.0.1.src.rpm
dlm-kernel-2.6.9-46.16.src.rpm
fence-1.32.10-0.src.rpm
fence-1.32.18-0.src.rpm
fence-1.32.25-1.src.rpm
fence-1.32.45-1.0.1.src.rpm
fence-1.32.45-1.src.rpm
fence-1.32.6-0.src.rpm
gulm-1.0.10-0.src.rpm
gulm-1.0.4-0.src.rpm
gulm-1.0.6-0.src.rpm
gulm-1.0.7-0.src.rpm
gulm-1.0.8-0.src.rpm
iddev-2.0.0-3.src.rpm
iddev-2.0.0-4.src.rpm
magma-1.0.1-4.src.rpm
magma-1.0.3-2.src.rpm
magma-1.0.4-0.src.rpm
magma-1.0.6-0.src.rpm
magma-1.0.7-1.src.rpm
magma-plugins-1.0.12-0.src.rpm
magma-plugins-1.0.2-0.src.rpm
magma-plugins-1.0.5-0.src.rpm
magma-plugins-1.0.6-0.src.rpm
magma-plugins-1.0.9-0.src.rpm
piranha-0.8.1-1.src.rpm
piranha-0.8.2-1.src.rpm
rgmanager-1.9.38-0.src.rpm
rgmanager-1.9.39-0.src.rpm
rgmanager-1.9.43-0.src.rpm
rgmanager-1.9.46-0.src.rpm
rgmanager-1.9.53-0.src.rpm
rgmanager-1.9.54-1.src.rpm
rgmanager-1.9.68-1.src.rpm
system-config-cluster-1.0.16-1.0.src.rpm
system-config-cluster-1.0.25-1.0.src.rpm
system-config-cluster-1.0.27-1.0.src.rpm
system-config-cluster-1.0.45-1.0.src.rpm
but I don't know which ones are suited for AS4U5, do you know?

-- 
He is nothing



From darkblue2000 at gmail.com  Mon Jul 30 03:37:42 2007
From: darkblue2000 at gmail.com (darkblue)
Date: Mon, 30 Jul 2007 11:37:42 +0800
Subject: [Linux-cluster] Which packages are the right combination for AS4U5?
Message-ID: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com>

Hello,
I am a newbie of cluster.I encounter a problem when installing cluster
suite on AS4U5, I download the following packages from
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS/

ccs-1.0.10-0.src.rpm
ccs-1.0.2-0.src.rpm
ccs-1.0.3-0.src.rpm
ccs-1.0.7-0.src.rpm
clustermon-0.9.1-8.src.rpm
cman-1.0.11-0.src.rpm
cman-1.0.17-0.src.rpm
cman-1.0.2-0.src.rpm
cman-1.0.4-0.src.rpm
cman-kernel-2.6.9-39.5.src.rpm
cman-kernel-2.6.9-39.8.src.rpm
cman-kernel-2.6.9-41.0.2.src.rpm
cman-kernel-2.6.9-41.0.src.rpm
cman-kernel-2.6.9-43.8.3.src.rpm
cman-kernel-2.6.9-43.8.5.src.rpm
cman-kernel-2.6.9-43.8.src.rpm
cman-kernel-2.6.9-45.14.src.rpm
cman-kernel-2.6.9-45.15.src.rpm
cman-kernel-2.6.9-45.2.src.rpm
cman-kernel-2.6.9-45.3.src.rpm
cman-kernel-2.6.9-45.4.src.rpm
cman-kernel-2.6.9-45.5.src.rpm
cman-kernel-2.6.9-45.8.src.rpm
cman-kernel-2.6.9-50.2.0.1.src.rpm
cman-kernel-2.6.9-50.2.src.rpm
conga-0.9.1-8.src.rpm
dlm-1.0.0-5.src.rpm
dlm-1.0.1-1.src.rpm
dlm-1.0.3-1.src.rpm
dlm-kernel-2.6.9-37.7.src.rpm
dlm-kernel-2.6.9-37.9.src.rpm
dlm-kernel-2.6.9-39.1.2.src.rpm
dlm-kernel-2.6.9-39.1.src.rpm
dlm-kernel-2.6.9-41.7.1.src.rpm
dlm-kernel-2.6.9-41.7.2.src.rpm
dlm-kernel-2.6.9-41.7.src.rpm
dlm-kernel-2.6.9-42.10.src.rpm
dlm-kernel-2.6.9-42.11.src.rpm
dlm-kernel-2.6.9-42.12.src.rpm
dlm-kernel-2.6.9-42.13.src.rpm
dlm-kernel-2.6.9-44.2.src.rpm
dlm-kernel-2.6.9-44.3.src.rpm
dlm-kernel-2.6.9-44.8.src.rpm
dlm-kernel-2.6.9-44.9.src.rpm
dlm-kernel-2.6.9-46.16.0.1.src.rpm
dlm-kernel-2.6.9-46.16.src.rpm
fence-1.32.10-0.src.rpm
fence-1.32.18-0.src.rpm
fence-1.32.25-1.src.rpm
fence-1.32.45-1.0.1.src.rpm
fence-1.32.45-1.src.rpm
fence-1.32.6-0.src.rpm
gulm-1.0.10-0.src.rpm
gulm-1.0.4-0.src.rpm
gulm-1.0.6-0.src.rpm
gulm-1.0.7-0.src.rpm
gulm-1.0.8-0.src.rpm
iddev-2.0.0-3.src.rpm
iddev-2.0.0-4.src.rpm
magma-1.0.1-4.src.rpm
magma-1.0.3-2.src.rpm
magma-1.0.4-0.src.rpm
magma-1.0.6-0.src.rpm
magma-1.0.7-1.src.rpm
magma-plugins-1.0.12-0.src.rpm
magma-plugins-1.0.2-0.src.rpm
magma-plugins-1.0.5-0.src.rpm
magma-plugins-1.0.6-0.src.rpm
magma-plugins-1.0.9-0.src.rpm
piranha-0.8.1-1.src.rpm
piranha-0.8.2-1.src.rpm
rgmanager-1.9.38-0.src.rpm
rgmanager-1.9.39-0.src.rpm
rgmanager-1.9.43-0.src.rpm
rgmanager-1.9.46-0.src.rpm
rgmanager-1.9.53-0.src.rpm
rgmanager-1.9.54-1.src.rpm
rgmanager-1.9.68-1.src.rpm
system-config-cluster-1.0.16-1.0.src.rpm
system-config-cluster-1.0.25-1.0.src.rpm
system-config-cluster-1.0.27-1.0.src.rpm
system-config-cluster-1.0.45-1.0.src.rpm

but I want to know which packages are the right combination for AS4U5?

-- 
He is nothing



From sebastian.walter at fu-berlin.de  Mon Jul 30 10:00:58 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Mon, 30 Jul 2007 12:00:58 +0200
Subject: [Linux-cluster] Which packages are the right combination for
	AS4U5?
In-Reply-To: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com>
References: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com>
Message-ID: <46ADB6DA.7000603@fu-berlin.de>

Hi,

maybe you want to orientate on the centos distribution. Easiest for you
would be to somehow get yum working and then import the whole
repository. If it's a new installation, I suggest you to switch to such
a rhel-compatible distribution as centos or scientific linux anyway.

http://mirror.centos.org/centos/4/csgfs/

Regards,
Sebastian


darkblue wrote:
> Hello,
> I am a newbie of cluster.I encounter a problem when installing cluster
> suite on AS4U5, I download the following packages from
> ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS/
>
> ccs-1.0.10-0.src.rpm
> ccs-1.0.2-0.src.rpm
> ccs-1.0.3-0.src.rpm
> ccs-1.0.7-0.src.rpm
> clustermon-0.9.1-8.src.rpm
> cman-1.0.11-0.src.rpm
> cman-1.0.17-0.src.rpm
> cman-1.0.2-0.src.rpm
> cman-1.0.4-0.src.rpm
> cman-kernel-2.6.9-39.5.src.rpm
> cman-kernel-2.6.9-39.8.src.rpm
> cman-kernel-2.6.9-41.0.2.src.rpm
> cman-kernel-2.6.9-41.0.src.rpm
> cman-kernel-2.6.9-43.8.3.src.rpm
> cman-kernel-2.6.9-43.8.5.src.rpm
> cman-kernel-2.6.9-43.8.src.rpm
> cman-kernel-2.6.9-45.14.src.rpm
> cman-kernel-2.6.9-45.15.src.rpm
> cman-kernel-2.6.9-45.2.src.rpm
> cman-kernel-2.6.9-45.3.src.rpm
> cman-kernel-2.6.9-45.4.src.rpm
> cman-kernel-2.6.9-45.5.src.rpm
> cman-kernel-2.6.9-45.8.src.rpm
> cman-kernel-2.6.9-50.2.0.1.src.rpm
> cman-kernel-2.6.9-50.2.src.rpm
> conga-0.9.1-8.src.rpm
> dlm-1.0.0-5.src.rpm
> dlm-1.0.1-1.src.rpm
> dlm-1.0.3-1.src.rpm
> dlm-kernel-2.6.9-37.7.src.rpm
> dlm-kernel-2.6.9-37.9.src.rpm
> dlm-kernel-2.6.9-39.1.2.src.rpm
> dlm-kernel-2.6.9-39.1.src.rpm
> dlm-kernel-2.6.9-41.7.1.src.rpm
> dlm-kernel-2.6.9-41.7.2.src.rpm
> dlm-kernel-2.6.9-41.7.src.rpm
> dlm-kernel-2.6.9-42.10.src.rpm
> dlm-kernel-2.6.9-42.11.src.rpm
> dlm-kernel-2.6.9-42.12.src.rpm
> dlm-kernel-2.6.9-42.13.src.rpm
> dlm-kernel-2.6.9-44.2.src.rpm
> dlm-kernel-2.6.9-44.3.src.rpm
> dlm-kernel-2.6.9-44.8.src.rpm
> dlm-kernel-2.6.9-44.9.src.rpm
> dlm-kernel-2.6.9-46.16.0.1.src.rpm
> dlm-kernel-2.6.9-46.16.src.rpm
> fence-1.32.10-0.src.rpm
> fence-1.32.18-0.src.rpm
> fence-1.32.25-1.src.rpm
> fence-1.32.45-1.0.1.src.rpm
> fence-1.32.45-1.src.rpm
> fence-1.32.6-0.src.rpm
> gulm-1.0.10-0.src.rpm
> gulm-1.0.4-0.src.rpm
> gulm-1.0.6-0.src.rpm
> gulm-1.0.7-0.src.rpm
> gulm-1.0.8-0.src.rpm
> iddev-2.0.0-3.src.rpm
> iddev-2.0.0-4.src.rpm
> magma-1.0.1-4.src.rpm
> magma-1.0.3-2.src.rpm
> magma-1.0.4-0.src.rpm
> magma-1.0.6-0.src.rpm
> magma-1.0.7-1.src.rpm
> magma-plugins-1.0.12-0.src.rpm
> magma-plugins-1.0.2-0.src.rpm
> magma-plugins-1.0.5-0.src.rpm
> magma-plugins-1.0.6-0.src.rpm
> magma-plugins-1.0.9-0.src.rpm
> piranha-0.8.1-1.src.rpm
> piranha-0.8.2-1.src.rpm
> rgmanager-1.9.38-0.src.rpm
> rgmanager-1.9.39-0.src.rpm
> rgmanager-1.9.43-0.src.rpm
> rgmanager-1.9.46-0.src.rpm
> rgmanager-1.9.53-0.src.rpm
> rgmanager-1.9.54-1.src.rpm
> rgmanager-1.9.68-1.src.rpm
> system-config-cluster-1.0.16-1.0.src.rpm
> system-config-cluster-1.0.25-1.0.src.rpm
> system-config-cluster-1.0.27-1.0.src.rpm
> system-config-cluster-1.0.45-1.0.src.rpm
>
> but I want to know which packages are the right combination for AS4U5?
>
>   



From sebastian.walter at fu-berlin.de  Mon Jul 30 11:34:53 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Mon, 30 Jul 2007 13:34:53 +0200
Subject: [Linux-cluster] Which packages are the right combination for
	AS4U5?
In-Reply-To: <2c8195ff0707300317m5ca565b8n6ea66465d72232f2@mail.gmail.com>
References: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com>	
	<46ADB6DA.7000603@fu-berlin.de>
	<2c8195ff0707300317m5ca565b8n6ea66465d72232f2@mail.gmail.com>
Message-ID: <46ADCCDD.1010209@fu-berlin.de>

If you are using RHEL in an production environment, I can only recommend
you to use the original rhel packages, as the centos' ones are modified.
Anyway, the versions of the rpm's should be the same. So this is the
list of packages what is installed on my centos 4.5 system:

(rhcs: rgmanager system-config-cluster ccsd magma magma-plugins cman
cman-kernel-smp dlm dlm-kernel-smp fence gulm iddev)
Installing:
cman                    x86_64     1.0.17-0         csgfs              67 k
cman-kernel-smp         x86_64     2.6.9-50.2       csgfs             133 k
dlm                     x86_64     1.0.3-1          csgfs              13 k
dlm-kernel-smp          x86_64     2.6.9-46.16      csgfs             132 k
fence                   x86_64     1.32.45-1.0.1    csgfs             282 k
gulm                    x86_64     1.0.10-0         csgfs             151 k
iddev                   x86_64     2.0.0-4          csgfs             2.3 k
magma                   x86_64     1.0.7-1          csgfs              37 k
magma-plugins           x86_64     1.0.12-0         csgfs              19 k
rgmanager               x86_64     1.9.68-1         csgfs             209 k
system-config-cluster   noarch     1.0.45-1.0       csgfs             122 k
Installing for dependencies:
ccs                     x86_64     1.0.10-0         csgfs              80 k
perl-Net-Telnet         noarch     3.03-3           csgfs              51 k
seamonkey-nss           x86_64     1.0.9-2.el4.centos  update           
872 k

(gfs: GFS GFS-kernel-smp gnbd gnbd-kernel-smp lvm2-cluster
GFS-kernheaders gnbd-kernheaders)
GFS                     x86_64     6.1.14-0         csgfs             152 k
GFS-kernel-smp          x86_64     2.6.9-72.2       csgfs             214 k
GFS-kernheaders         x86_64     2.6.9-72.2       csgfs              20 k
gnbd                    x86_64     1.0.9-1          csgfs             142 k
gnbd-kernel-smp         x86_64     2.6.9-10.20      csgfs              13 k
gnbd-kernheaders        x86_64     2.6.9-10.20      csgfs             4.1 k
lvm2-cluster            x86_64     2.02.21-7.el4    csgfs             199 k

In this configuration, which comes from the yum rhcs repository, I had
to downgrade to kernel kernel-smp-2.6.9-55.EL. Maybe you also want to
install luci and ricci:

yum install luci
Installing:
 luci                    x86_64     0.9.1-8.el4.centos.1  csgfs

yum install ricci
Installing:
 ricci                   x86_64     0.9.1-8.el4.centos.1  csgfs 1.1 M
Installing for dependencies:
 modcluster              x86_64     0.9.1-8.el4.centos 
csgfs             317 k
 oddjob                  x86_64     0.26-1.1         base               57 k
 oddjob-libs             x86_64     0.26-1.1         base               43 k

That's it. Regards,
Sebastian

darkblue wrote:
> thanks very much, I have been waiting this letter for the whole day.
> May I using yum to install centos's packages on redhat as4u5, because
> the OS of the production server is redhat as4u5.
>
> 2007/7/30, Sebastian Walter wrote:
>   
>> Hi,
>>
>> maybe you want to orientate on the centos distribution. Easiest for you
>> would be to somehow get yum working and then import the whole
>> repository. If it's a new installation, I suggest you to switch to such
>> a rhel-compatible distribution as centos or scientific linux anyway.
>>
>> http://mirror.centos.org/centos/4/csgfs/
>>
>> Regards,
>> Sebastian
>>
>>
>> darkblue wrote:
>>     
>>> Hello,
>>> I am a newbie of cluster.I encounter a problem when installing cluster
>>> suite on AS4U5, I download the following packages from
>>> ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS/
>>>
>>> ccs-1.0.10-0.src.rpm
>>> ccs-1.0.2-0.src.rpm
>>> ccs-1.0.3-0.src.rpm
>>> ccs-1.0.7-0.src.rpm
>>> clustermon-0.9.1-8.src.rpm
>>> cman-1.0.11-0.src.rpm
>>> cman-1.0.17-0.src.rpm
>>> cman-1.0.2-0.src.rpm
>>> cman-1.0.4-0.src.rpm
>>> cman-kernel-2.6.9-39.5.src.rpm
>>> cman-kernel-2.6.9-39.8.src.rpm
>>> cman-kernel-2.6.9-41.0.2.src.rpm
>>> cman-kernel-2.6.9-41.0.src.rpm
>>> cman-kernel-2.6.9-43.8.3.src.rpm
>>> cman-kernel-2.6.9-43.8.5.src.rpm
>>> cman-kernel-2.6.9-43.8.src.rpm
>>> cman-kernel-2.6.9-45.14.src.rpm
>>> cman-kernel-2.6.9-45.15.src.rpm
>>> cman-kernel-2.6.9-45.2.src.rpm
>>> cman-kernel-2.6.9-45.3.src.rpm
>>> cman-kernel-2.6.9-45.4.src.rpm
>>> cman-kernel-2.6.9-45.5.src.rpm
>>> cman-kernel-2.6.9-45.8.src.rpm
>>> cman-kernel-2.6.9-50.2.0.1.src.rpm
>>> cman-kernel-2.6.9-50.2.src.rpm
>>> conga-0.9.1-8.src.rpm
>>> dlm-1.0.0-5.src.rpm
>>> dlm-1.0.1-1.src.rpm
>>> dlm-1.0.3-1.src.rpm
>>> dlm-kernel-2.6.9-37.7.src.rpm
>>> dlm-kernel-2.6.9-37.9.src.rpm
>>> dlm-kernel-2.6.9-39.1.2.src.rpm
>>> dlm-kernel-2.6.9-39.1.src.rpm
>>> dlm-kernel-2.6.9-41.7.1.src.rpm
>>> dlm-kernel-2.6.9-41.7.2.src.rpm
>>> dlm-kernel-2.6.9-41.7.src.rpm
>>> dlm-kernel-2.6.9-42.10.src.rpm
>>> dlm-kernel-2.6.9-42.11.src.rpm
>>> dlm-kernel-2.6.9-42.12.src.rpm
>>> dlm-kernel-2.6.9-42.13.src.rpm
>>> dlm-kernel-2.6.9-44.2.src.rpm
>>> dlm-kernel-2.6.9-44.3.src.rpm
>>> dlm-kernel-2.6.9-44.8.src.rpm
>>> dlm-kernel-2.6.9-44.9.src.rpm
>>> dlm-kernel-2.6.9-46.16.0.1.src.rpm
>>> dlm-kernel-2.6.9-46.16.src.rpm
>>> fence-1.32.10-0.src.rpm
>>> fence-1.32.18-0.src.rpm
>>> fence-1.32.25-1.src.rpm
>>> fence-1.32.45-1.0.1.src.rpm
>>> fence-1.32.45-1.src.rpm
>>> fence-1.32.6-0.src.rpm
>>> gulm-1.0.10-0.src.rpm
>>> gulm-1.0.4-0.src.rpm
>>> gulm-1.0.6-0.src.rpm
>>> gulm-1.0.7-0.src.rpm
>>> gulm-1.0.8-0.src.rpm
>>> iddev-2.0.0-3.src.rpm
>>> iddev-2.0.0-4.src.rpm
>>> magma-1.0.1-4.src.rpm
>>> magma-1.0.3-2.src.rpm
>>> magma-1.0.4-0.src.rpm
>>> magma-1.0.6-0.src.rpm
>>> magma-1.0.7-1.src.rpm
>>> magma-plugins-1.0.12-0.src.rpm
>>> magma-plugins-1.0.2-0.src.rpm
>>> magma-plugins-1.0.5-0.src.rpm
>>> magma-plugins-1.0.6-0.src.rpm
>>> magma-plugins-1.0.9-0.src.rpm
>>> piranha-0.8.1-1.src.rpm
>>> piranha-0.8.2-1.src.rpm
>>> rgmanager-1.9.38-0.src.rpm
>>> rgmanager-1.9.39-0.src.rpm
>>> rgmanager-1.9.43-0.src.rpm
>>> rgmanager-1.9.46-0.src.rpm
>>> rgmanager-1.9.53-0.src.rpm
>>> rgmanager-1.9.54-1.src.rpm
>>> rgmanager-1.9.68-1.src.rpm
>>> system-config-cluster-1.0.16-1.0.src.rpm
>>> system-config-cluster-1.0.25-1.0.src.rpm
>>> system-config-cluster-1.0.27-1.0.src.rpm
>>> system-config-cluster-1.0.45-1.0.src.rpm
>>>
>>> but I want to know which packages are the right combination for AS4U5?
>>>
>>>
>>>       
>>     
>
>
>   



From breeves at redhat.com  Mon Jul 30 11:35:21 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Mon, 30 Jul 2007 12:35:21 +0100
Subject: [Linux-cluster] where can I download =?UTF-8?B?cmhjcy5pc2/vvJ8=?=
In-Reply-To: <2c8195ff0707290322i733590faq7dd2695b09c247dd@mail.gmail.com>
References: <2c8195ff0707290142j4912c95cwdeec398736b9042a@mail.gmail.com>	<1185699547.30014.9.camel@kuli.magnifix.com.my>	<2c8195ff0707290216x1a27fca0sdeeb139e776ed9a4@mail.gmail.com>	<1185701715.30014.12.camel@kuli.magnifix.com.my>
	<2c8195ff0707290322i733590faq7dd2695b09c247dd@mail.gmail.com>
Message-ID: <46ADCCF9.1080809@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

darkblue wrote:
> I have an 30-day unsupport evalution subscription, and already actived it.
> but still can't access rhcs channel. any body know how to do it?
> 

The 30-day evaluation RHN subscription only includes entitlements for
the base RHEL channels - you should contact a sales representative if
you need to evaluate the clustering components too and they will be able
to advise you.

Kind regards,
Bryn.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGrcz56YSQoMYUY94RAjIXAKDa8w/TxgeJfBGmW7iNBRnVcWvwdQCdGa28
Sj54MX7G9JVC2ox1FPhanx0=
=JA5D
-----END PGP SIGNATURE-----



From lhh at redhat.com  Mon Jul 30 13:56:20 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 30 Jul 2007 09:56:20 -0400
Subject: [Linux-cluster] RGMANAGER segmentation fault
In-Reply-To: <727995.29330.qm@web52301.mail.re2.yahoo.com>
References: <727995.29330.qm@web52301.mail.re2.yahoo.com>
Message-ID: <20070730135620.GL9112@redhat.com>

On Sat, Jul 28, 2007 at 12:15:57AM -0700, chirantha pitigala wrote:
> Hi,
> 
> I checked the errata but was not able to find out any rgamanger related fixes. What is the bug id you are refering to?
> 

Apparently, the bug in question wasn't made public when it should have
been.  Here's a clone of the bug.  You may or may not be able to access
the linked patch directly.  However, it's also linked off of my people
page, in CVS, and in the obsolete, totally unsupported test RPMs on my
people page.  (Obviously, they're in 4.5 too ;) )

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250085
http://people.redhat.com/lhh/patches.html

The minimum set for getting *rgmanager* updated to the 4.5 version is
the following (I think):

  ccs
  magma-plugins
  magma
  rgmanager

If you're hitting problems with rgmanager and you do are not ready to
update the whole system to 4.5, you should be able to update those
four packages.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Jul 30 14:02:07 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 30 Jul 2007 10:02:07 -0400
Subject: [Linux-cluster] RGMANAGER segmentation fault
In-Reply-To: <20070730135620.GL9112@redhat.com>
References: <727995.29330.qm@web52301.mail.re2.yahoo.com>
	<20070730135620.GL9112@redhat.com>
Message-ID: <20070730140207.GA4955@redhat.com>

1  magma-plugins
2  magma
3  ccs
4  rgmanager

Note: update them in the above order, and restart ccsd before restarting
rgmanager, or rgmanager won't see your services.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Jul 30 14:15:25 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 30 Jul 2007 10:15:25 -0400
Subject: [Linux-cluster] rgmanager "depend" tag on RHEL4
In-Reply-To: <200707272034.40310.hlawatschek@atix.de>
References: <200707241621.45892.hlawatschek@atix.de>
	<200707252050.17898.hlawatschek@atix.de>
	<20070727134519.GD9112@redhat.com>
	<200707272034.40310.hlawatschek@atix.de>
Message-ID: <20070730141525.GB4955@redhat.com>

On Fri, Jul 27, 2007 at 08:34:40PM +0200, Mark Hlawatschek wrote:
> On Friday 27 July 2007 15:45:19 Lon Hohberger wrote:
> > I think there's already a feature request open about it, by you?
> Yes, there are two feature requests open that address service dependencies
> 247980 - RFE: (strong and weak) service dependencies
> 247772 - RFE: One service following another
> >
> > Implementation-wise, in RHEL5, there's an event queue for service state
> > changes, and a function which broadcasts starts/stops
> > (broadcast_event()).  The difference is that in RHEL5, you can multicast
> > to all rgmanagers w/ one call - whereas on RHEL4, you have to open a
> > connection to each and send the message individually.
> >
> > So, basically, it involves porting the event queue code and implementing
> > broadcast_event().
> OK, how much effort - in your opinion - would it be to port/implement these 
> features ?

'follow' is harder to implement cleanly than strong vs. weak deps;
mostly because it's a two-step process.  On RHEL5, it's easier due to
the event mechanism (you can trigger off 'service started' statechange
and move the conflicting service to another node).

The easiest path would be to port the event mechanism to RHEL4.
So, it's more likely to appear in RHEL5 first (less work).

The current train of thought is something like this:

  * A follows B

  * When A transitions to started, move B to another
    node if they are on the same node.

> Is there a task scheduled to do these things ? 

I can find out.

I assume you're willing to assist in testing, is that right?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Jul 30 14:16:49 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 30 Jul 2007 10:16:49 -0400
Subject: [Linux-cluster] Qdisk configuration
In-Reply-To: <OFE8B0F0D8.6B4D59D2-ON85257325.0064DA64-85257325.0065BCBF@jmsmucker.com>
References: <46AA34EE.4000509@transolutions.net>
	<OFE8B0F0D8.6B4D59D2-ON85257325.0064DA64-85257325.0065BCBF@jmsmucker.com>
Message-ID: <20070730141649.GC4955@redhat.com>

On Fri, Jul 27, 2007 at 02:26:05PM -0400, Dan.Askew at jmsmucker.com wrote:
> Greetings all,
> 
> I have a small question on using a quorum disk. Below is me config for the 
> quorum disk
> 
>               <quorumd device="/dev/sdc1" interval="1" label="qdisk" 
> min_score="3" tko="10" votes="3">
>                         <heuristic interval="1" program="/bin/ping 
> 10.1.0.18 -c1 -t1" score="1"/>
>               </quorumd>
> 
> 
> The device is pressent and the label is correct
> 
> root at goober1:/root> e2label /dev/sdc1
> qdisk

Erh, e2label can't read qdisk partitions ;)

try

mkqdisk -L

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Jul 30 14:17:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 30 Jul 2007 10:17:51 -0400
Subject: [Linux-cluster] IP not failing over
In-Reply-To: <46AA5702.6010002@transolutions.net>
References: <46A8F63D.8000603@transolutions.net>
	<20070727170942.GE9112@redhat.com>
	<46AA34EE.4000509@transolutions.net>
	<46AA5702.6010002@transolutions.net>
Message-ID: <20070730141751.GD4955@redhat.com>

On Fri, Jul 27, 2007 at 03:35:14PM -0500, James Wilson wrote:
> Would it be easier to create a script to bing up the virtual ip and have 
> rgmanager start the script on failover?
> 
> James Wilson wrote:
> >I removed the second entry for the IP resource but the IP still 
> >doesn't failover to the second node. Here is the output from rg_test

When you enable/disable the service what do the logs say?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Jul 30 14:21:42 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 30 Jul 2007 10:21:42 -0400
Subject: [Linux-cluster] IP not failing over
In-Reply-To: <20070730141751.GD4955@redhat.com>
References: <46A8F63D.8000603@transolutions.net>
	<20070727170942.GE9112@redhat.com>
	<46AA34EE.4000509@transolutions.net>
	<46AA5702.6010002@transolutions.net>
	<20070730141751.GD4955@redhat.com>
Message-ID: <20070730142142.GE4955@redhat.com>

On Mon, Jul 30, 2007 at 10:17:51AM -0400, Lon Hohberger wrote:
> On Fri, Jul 27, 2007 at 03:35:14PM -0500, James Wilson wrote:
> > Would it be easier to create a script to bing up the virtual ip and have 
> > rgmanager start the script on failover?
> > 
> > James Wilson wrote:
> > >I removed the second entry for the IP resource but the IP still 
> > >doesn't failover to the second node. Here is the output from rg_test
> 
> When you enable/disable the service what do the logs say?
> 

Heh, dumb question, are you running 'fence_ack_manual' on the surviving
node?

(That's required when using manual fencing...)

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From hlawatschek at atix.de  Mon Jul 30 14:25:29 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Mon, 30 Jul 2007 16:25:29 +0200
Subject: [Linux-cluster] rgmanager "depend" tag on RHEL4
In-Reply-To: <20070730141525.GB4955@redhat.com>
References: <200707241621.45892.hlawatschek@atix.de>
	<200707272034.40310.hlawatschek@atix.de>
	<20070730141525.GB4955@redhat.com>
Message-ID: <200707301625.30182.hlawatschek@atix.de>

On Monday 30 July 2007 16:15:25 Lon Hohberger wrote:
> On Fri, Jul 27, 2007 at 08:34:40PM +0200, Mark Hlawatschek wrote:
> > On Friday 27 July 2007 15:45:19 Lon Hohberger wrote:
> > > I think there's already a feature request open about it, by you?
> >
> > Yes, there are two feature requests open that address service
> > dependencies 247980 - RFE: (strong and weak) service dependencies
> > 247772 - RFE: One service following another
> >
> > > Implementation-wise, in RHEL5, there's an event queue for service state
> > > changes, and a function which broadcasts starts/stops
> > > (broadcast_event()).  The difference is that in RHEL5, you can
> > > multicast to all rgmanagers w/ one call - whereas on RHEL4, you have to
> > > open a connection to each and send the message individually.
> > >
> > > So, basically, it involves porting the event queue code and
> > > implementing broadcast_event().
> >
> > OK, how much effort - in your opinion - would it be to port/implement
> > these features ?
>
> 'follow' is harder to implement cleanly than strong vs. weak deps;
> mostly because it's a two-step process.  On RHEL5, it's easier due to
> the event mechanism (you can trigger off 'service started' statechange
> and move the conflicting service to another node).
>
> The easiest path would be to port the event mechanism to RHEL4.
> So, it's more likely to appear in RHEL5 first (less work).
>
> The current train of thought is something like this:
>
>   * A follows B
>
>   * When A transitions to started, move B to another
>     node if they are on the same node.
>
> > Is there a task scheduled to do these things ?
>
> I can find out.
>
> I assume you're willing to assist in testing, is that right?

I certainly will do a lot of testing :-)

Thanks,

Mark


-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany




From jwilson at transolutions.net  Mon Jul 30 14:40:46 2007
From: jwilson at transolutions.net (James Wilson)
Date: Mon, 30 Jul 2007 09:40:46 -0500
Subject: [Linux-cluster] IP failover
Message-ID: <46ADF86E.6080002@transolutions.net>

Is it possibel with rhcs5 to failover an IP without specifying the 
shared storage. I have 2 servers being replicated by drbd 
primary/primary and when one fails I need the ip address to switch to 
the other server.



From bernard.chew at muvee.com  Mon Jul 30 14:50:02 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Mon, 30 Jul 2007 22:50:02 +0800
Subject: [Linux-cluster] create GFS file system on imported iSCSI disk
Message-ID: <229C73600EB0E54DA818AB599482BCE901921863@shadowfax.sg.muvee.net>

Hi,

I read through the Red Hat Docs, and could not find much information on
iSCSI configurations. I have a iSCSI target server and 4 other servers
running iSCSI initiators. 

May I know if we need to set up logical volumes on the imported iSCSI
disk before creating GFS or we can immediately run "gfs_mkfs -p lock_dlm
-t alpha_cluster:gfs01 -j 8 /dev/sdb" (where /dev/sdb refers to the
imported iSCSI disk) on one node and all nodes to "mount -t gfs /dev/sdb
/test"?

Thanks,
Bernard Chew
IT Operations



From breeves at redhat.com  Mon Jul 30 14:55:15 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Mon, 30 Jul 2007 15:55:15 +0100
Subject: [Linux-cluster] create GFS file system on imported iSCSI disk
In-Reply-To: <229C73600EB0E54DA818AB599482BCE901921863@shadowfax.sg.muvee.net>
References: <229C73600EB0E54DA818AB599482BCE901921863@shadowfax.sg.muvee.net>
Message-ID: <46ADFBD3.7060200@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bernard Chew wrote:
> May I know if we need to set up logical volumes on the imported iSCSI
> disk before creating GFS or we can immediately run "gfs_mkfs -p lock_dlm
> -t alpha_cluster:gfs01 -j 8 /dev/sdb" (where /dev/sdb refers to the
> imported iSCSI disk) on one node and all nodes to "mount -t gfs /dev/sdb
> /test"?
> 

There aren't any special steps needed for iSCSI over any other kind of
shared storage (once you've configured the initiators & the devices are
visible to the OS).

Creating volume groups and using logical volumes for GFS is a good idea
if you are likely to want to resize your devices at a later time but is
not strictly necessary.

Other than that, the steps you detailed should work fine.

Kind regards,
Bryn.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGrfvT6YSQoMYUY94RAgGFAJ96WnspeUgKHKiBwHRh71aluGcoUgCfXvrB
fg2wFdqf96s6kciF0ypfzB0=
=cTyA
-----END PGP SIGNATURE-----



From dist-list at LEXUM.UMontreal.CA  Mon Jul 30 15:44:34 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Mon, 30 Jul 2007 11:44:34 -0400
Subject: [Linux-cluster] Redhat 4.5 and lvm2-monitor 
Message-ID: <46AE0762.6020300@lexum.umontreal.ca>

Hello,
Since the 4.5 update, I have losts of error during boot.
my lvm.conf looks like :
locking_type = 2
library_dir = "/usr/lib"
locking_library = "liblvm2clusterlock.so"

The 4.5 update create a lvm.conf.rpmnew with these options :
locking_type = 1
(...)

Do we still need the loocking_type=2 or can we use the default lvm.conf ?

tx



From pbruna at it-linux.cl  Mon Jul 30 18:05:12 2007
From: pbruna at it-linux.cl (Patricio Bruna V.)
Date: Mon, 30 Jul 2007 14:05:12 -0400
Subject: [Linux-cluster] IP failover
In-Reply-To: <46ADF86E.6080002@transolutions.net>
References: <46ADF86E.6080002@transolutions.net>
Message-ID: <200707301405.12814.pbruna@it-linux.cl>

El Monday 30 July 2007 10:40:46 James Wilson escribi?:
> Is it possibel with rhcs5 to failover an IP without specifying the
> shared storage. I have 2 servers being replicated by drbd
> primary/primary and when one fails I need the ip address to switch to
> the other server.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Yes, The IP would be the clustered service.

-- 
Patricio Bruna V.
IT Linux Ltda.
http://www.it-linux.cl
Fono : (+56-2) 333 0578
M?vil : (+56-09) 8827 0342
Gizmo: 1-747-100-4794 (itlinux)



From Robert.Gil at americanhm.com  Mon Jul 30 18:18:03 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Mon, 30 Jul 2007 14:18:03 -0400
Subject: [Linux-cluster] Redhat 4.5 and lvm2-monitor
In-Reply-To: <46AE0762.6020300@lexum.umontreal.ca>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA9A7@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

Yes, you still need those settings. If you reboot, you wont be able to
mount your volumes.


Rob Gil
Linux Systems Administrator

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of FM
Sent: Monday, July 30, 2007 11:45 AM
To: Redhat Cluster
Subject: [Linux-cluster] Redhat 4.5 and lvm2-monitor

Hello,
Since the 4.5 update, I have losts of error during boot.
my lvm.conf looks like :
locking_type = 2
library_dir = "/usr/lib"
locking_library = "liblvm2clusterlock.so"

The 4.5 update create a lvm.conf.rpmnew with these options :
locking_type = 1
(...)

Do we still need the loocking_type=2 or can we use the default lvm.conf
?

tx

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jbrassow at redhat.com  Mon Jul 30 18:16:26 2007
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Mon, 30 Jul 2007 13:16:26 -0500
Subject: [Linux-cluster] lvremove error
In-Reply-To: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>
References: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>
Message-ID: <87363102-11BD-4A9D-9703-95FF0BED0A9E@redhat.com>

lvremove doesn't seem like it wants to remove the volume because it  
is in use on another node.  Is this true?  If so, the GUI shouldn't  
be removing it either.

  brassow

On Jul 28, 2007, at 3:04 PM, Mahmoud Hanafi wrote:

>
>
> I am trying to remove a lv via cmd line but I get the following error.
>
> [root at nfs1 /]# lvremove /dev/gfs_vg/gfs_st2_sz512
> Do you really want to remove active logical volume "gfs_st2_sz512"?  
> [y/n]: y
>   Error locking on node nfs1.local: Volume is busy on another node
>   Can't get exclusive access to volume "gfs_st2_sz512"
>
> But I can remove lv via the GUI (luci).
>
> What am I doing wrong on the cmd line.
>
>
>
> Thanks,
> Mahmoud
>
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ------------------------------------
> This is a PRIVATE message. If you are not the intended recipient,  
> please delete without copying and kindly advise us by e-mail of the  
> mistake in delivery. NOTE: Regardless of content, this e-mail shall  
> not operate to bind CSC to any order or other contract unless  
> pursuant to explicit written agreement or government initiative  
> expressly permitting the use of e-mail for such purpose.
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ------------------------------------
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070730/d9eb2241/attachment.htm>

From jbrassow at redhat.com  Mon Jul 30 18:17:37 2007
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Mon, 30 Jul 2007 13:17:37 -0500
Subject: [Linux-cluster] Redhat 4.5 and lvm2-monitor 
In-Reply-To: <46AE0762.6020300@lexum.umontreal.ca>
References: <46AE0762.6020300@lexum.umontreal.ca>
Message-ID: <BC20006E-8F8A-44A3-9740-EA4391B6D7E8@redhat.com>

If you are using cluster LVM, try setting this 'locking_type = 3'

  brassow

On Jul 30, 2007, at 10:44 AM, FM wrote:

> Hello,
> Since the 4.5 update, I have losts of error during boot.
> my lvm.conf looks like :
> locking_type = 2
> library_dir = "/usr/lib"
> locking_library = "liblvm2clusterlock.so"
>
> The 4.5 update create a lvm.conf.rpmnew with these options :
> locking_type = 1
> (...)
>
> Do we still need the loocking_type=2 or can we use the default  
> lvm.conf ?
>
> tx
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jparsons at redhat.com  Mon Jul 30 18:48:10 2007
From: jparsons at redhat.com (jim parsons)
Date: Mon, 30 Jul 2007 14:48:10 -0400
Subject: [Linux-cluster] lvremove error
In-Reply-To: <87363102-11BD-4A9D-9703-95FF0BED0A9E@redhat.com>
References: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>
	<87363102-11BD-4A9D-9703-95FF0BED0A9E@redhat.com>
Message-ID: <1185821290.3225.5.camel@localhost.localdomain>

Thanks for this notice, Jonathan; I will check if this is in error...

-j

On Mon, 2007-07-30 at 13:16 -0500, Jonathan Brassow wrote:
> lvremove doesn't seem like it wants to remove the volume because it is
> in use on another node.  Is this true?  If so, the GUI shouldn't be
> removing it either.
> 
> 
>  brassow
> 
> On Jul 28, 2007, at 3:04 PM, Mahmoud Hanafi wrote:
> 
> > 
> > 
> > I am trying to remove a lv via cmd line but I get the following
> > error. 
> > 
> > [root at nfs1 /]# lvremove /dev/gfs_vg/gfs_st2_sz512 
> > Do you really want to remove active logical volume "gfs_st2_sz512"?
> > [y/n]: y 
> >   Error locking on node nfs1.local: Volume is busy on another node 
> >   Can't get exclusive access to volume "gfs_st2_sz512"
> > 
> > But I can remove lv via the GUI (luci).  
> > 
> > What am I doing wrong on the cmd line. 
> > 
> > 
> > 
> > Thanks, 
> > Mahmoud
> > 
> > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > This is a PRIVATE message. If you are not the intended recipient,
> > please delete without copying and kindly advise us by e-mail of the
> > mistake in delivery. NOTE: Regardless of content, this e-mail shall
> > not operate to bind CSC to any order or other contract unless
> > pursuant to explicit written agreement or government initiative
> > expressly permitting the use of e-mail for such purpose.
> > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rmccabe at redhat.com  Mon Jul 30 18:56:43 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Mon, 30 Jul 2007 14:56:43 -0400
Subject: [Linux-cluster] lvremove error
In-Reply-To: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>
References: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>
Message-ID: <20070730185643.GA18581@redhat.com>

On Sat, Jul 28, 2007 at 04:04:03PM -0400, Mahmoud Hanafi wrote:
> I am trying to remove a lv via cmd line but I get the following error.
> 
> [root at nfs1 /]# lvremove /dev/gfs_vg/gfs_st2_sz512
> Do you really want to remove active logical volume "gfs_st2_sz512"? [y/n]: 
> y
>   Error locking on node nfs1.local: Volume is busy on another node
>   Can't get exclusive access to volume "gfs_st2_sz512"
> 
> But I can remove lv via the GUI (luci). 

Conga (luci) will execute an lvchange -an /dev/gfs_vg/gfs_st2_sz512
before executing lvremove. That command will (should) fail if the volume
is in use by other nodes. Can you try to execute that command first and
see what the results are?


Ryan



From jparsons at redhat.com  Mon Jul 30 18:58:17 2007
From: jparsons at redhat.com (jim parsons)
Date: Mon, 30 Jul 2007 14:58:17 -0400
Subject: [Linux-cluster] lvremove error
In-Reply-To: <1185821290.3225.5.camel@localhost.localdomain>
References: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>
	<87363102-11BD-4A9D-9703-95FF0BED0A9E@redhat.com>
	<1185821290.3225.5.camel@localhost.localdomain>
Message-ID: <1185821897.3225.8.camel@localhost.localdomain>


Follow-up:

 luci just calls the same command line command on the ricci agent.
ricci does  lvchange -an which will error out if the volume is in use by
other cluster members
We just confirmed this behavior on a test cluster.

-J

On Mon, 2007-07-30 at 14:48 -0400, jim parsons wrote:
> Thanks for this notice, Jonathan; I will check if this is in error...
> 
> -j
> 
> On Mon, 2007-07-30 at 13:16 -0500, Jonathan Brassow wrote:
> > lvremove doesn't seem like it wants to remove the volume because it is
> > in use on another node.  Is this true?  If so, the GUI shouldn't be
> > removing it either.
> > 
> > 
> >  brassow
> > 
> > On Jul 28, 2007, at 3:04 PM, Mahmoud Hanafi wrote:
> > 
> > > 
> > > 
> > > I am trying to remove a lv via cmd line but I get the following
> > > error. 
> > > 
> > > [root at nfs1 /]# lvremove /dev/gfs_vg/gfs_st2_sz512 
> > > Do you really want to remove active logical volume "gfs_st2_sz512"?
> > > [y/n]: y 
> > >   Error locking on node nfs1.local: Volume is busy on another node 
> > >   Can't get exclusive access to volume "gfs_st2_sz512"
> > > 
> > > But I can remove lv via the GUI (luci).  
> > > 
> > > What am I doing wrong on the cmd line. 
> > > 
> > > 
> > > 
> > > Thanks, 
> > > Mahmoud
> > > 
> > > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > This is a PRIVATE message. If you are not the intended recipient,
> > > please delete without copying and kindly advise us by e-mail of the
> > > mistake in delivery. NOTE: Regardless of content, this e-mail shall
> > > not operate to bind CSC to any order or other contract unless
> > > pursuant to explicit written agreement or government initiative
> > > expressly permitting the use of e-mail for such purpose.
> > > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From dist-list at LEXUM.UMontreal.CA  Mon Jul 30 19:15:16 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Mon, 30 Jul 2007 15:15:16 -0400
Subject: [Linux-cluster] Redhat 4.5 and lvm2-monitor
In-Reply-To: <BC20006E-8F8A-44A3-9740-EA4391B6D7E8@redhat.com>
References: <46AE0762.6020300@lexum.umontreal.ca>
	<BC20006E-8F8A-44A3-9740-EA4391B6D7E8@redhat.com>
Message-ID: <46AE38C4.1090201@lexum.umontreal.ca>

tx for the reply,
I looked at the bug :
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=239779

and this email
http://www.redhat.com/archives/linux-cluster/2007-May/msg00149.html

I do not use mirrored LVM so I disabled the service  lvm2-monitore at
startup

Tx again!

------------------------------------
Fr?d?ric M?dery
Administrateur Syst?me /
System Administrator
LexUM, Universit? de Montr?al
email : mederyf at lexum.umontreal.ca
tel.  : (514) 343-6111  p. 1-3288
fax. : (514) 343-7359
------------------------------------



Jonathan Brassow wrote:
> If you are using cluster LVM, try setting this 'locking_type = 3'
>
>  brassow
>
> On Jul 30, 2007, at 10:44 AM, FM wrote:
>
>> Hello,
>> Since the 4.5 update, I have losts of error during boot.
>> my lvm.conf looks like :
>> locking_type = 2
>> library_dir = "/usr/lib"
>> locking_library = "liblvm2clusterlock.so"
>>
>> The 4.5 update create a lvm.conf.rpmnew with these options :
>> locking_type = 1
>> (...)
>>
>> Do we still need the loocking_type=2 or can we use the default
>> lvm.conf ?
>>
>> tx
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From chris at cmiware.com  Mon Jul 30 23:01:23 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 30 Jul 2007 18:01:23 -0500
Subject: [Linux-cluster] fence_apc failure
Message-ID: <46AE6DC3.1060807@cmiware.com>

We now have some APC 7931 units at our disposal, however the fence_apc 
perl script fails with "unrecognized menu response."  This appears to be 
from new firmware on the APC units.  There is a python script in CVS 
that looks like it may operate correctly with the new firmware menus, is 
this correct?  Also, is it a drop-in replacement for the perl script, 
i.e. will saving it as /sbin/fence_apc work with RHCS 5?

Thanks,
Chris



From eschneid at uccs.edu  Mon Jul 30 23:01:58 2007
From: eschneid at uccs.edu (Eric Schneider)
Date: Mon, 30 Jul 2007 17:01:58 -0600
Subject: [Linux-cluster] fence_apc failure
In-Reply-To: <46AE6DC3.1060807@cmiware.com>
References: <46AE6DC3.1060807@cmiware.com>
Message-ID: <002601c7d2fd$a1f75bd0$1b03c680@uccs.edu>

I have the same devices and I can use the default fence_apc as long as I
don't change the default names on the ports.  ON RHEL 5 BTW.

Eric

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chris Harms
Sent: Monday, July 30, 2007 5:01 PM
To: linux clustering
Subject: [Linux-cluster] fence_apc failure

We now have some APC 7931 units at our disposal, however the fence_apc 
perl script fails with "unrecognized menu response."  This appears to be 
from new firmware on the APC units.  There is a python script in CVS 
that looks like it may operate correctly with the new firmware menus, is 
this correct?  Also, is it a drop-in replacement for the perl script, 
i.e. will saving it as /sbin/fence_apc work with RHCS 5?

Thanks,
Chris

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From chris at cmiware.com  Mon Jul 30 23:12:24 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 30 Jul 2007 18:12:24 -0500
Subject: [Linux-cluster] fence_apc failure
In-Reply-To: <002601c7d2fd$a1f75bd0$1b03c680@uccs.edu>
References: <46AE6DC3.1060807@cmiware.com>
	<002601c7d2fd$a1f75bd0$1b03c680@uccs.edu>
Message-ID: <46AE7058.4000601@cmiware.com>

Hi Eric,

What version of the APC firmware are you running?

Thanks,
Chris

Eric Schneider wrote:
> I have the same devices and I can use the default fence_apc as long as I
> don't change the default names on the ports.  ON RHEL 5 BTW.
>
> Eric
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chris Harms
> Sent: Monday, July 30, 2007 5:01 PM
> To: linux clustering
> Subject: [Linux-cluster] fence_apc failure
>
> We now have some APC 7931 units at our disposal, however the fence_apc 
> perl script fails with "unrecognized menu response."  This appears to be 
> from new firmware on the APC units.  There is a python script in CVS 
> that looks like it may operate correctly with the new firmware menus, is 
> this correct?  Also, is it a drop-in replacement for the perl script, 
> i.e. will saving it as /sbin/fence_apc work with RHCS 5?
>
> Thanks,
> Chris
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From chris at cmiware.com  Mon Jul 30 23:25:55 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 30 Jul 2007 18:25:55 -0500
Subject: [Linux-cluster] fence_apc failure
In-Reply-To: <46AE6DC3.1060807@cmiware.com>
References: <46AE6DC3.1060807@cmiware.com>
Message-ID: <46AE7383.6090706@cmiware.com>

This appears to be a "no." 

 fence_node[26827]: agent "fence_apc" reports: Power Off 
unsuccessfulStatus check successful. Port 4 is ON

However the node was powered off so something worked.  Also, it appears 
to have sent an Off command instead of Reboot.


Chris Harms wrote:
> We now have some APC 7931 units at our disposal, however the fence_apc 
> perl script fails with "unrecognized menu response."  This appears to 
> be from new firmware on the APC units.  There is a python script in 
> CVS that looks like it may operate correctly with the new firmware 
> menus, is this correct?  Also, is it a drop-in replacement for the 
> perl script, i.e. will saving it as /sbin/fence_apc work with RHCS 5?
>
> Thanks,
> Chris
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From chris at cmiware.com  Tue Jul 31 00:09:32 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 30 Jul 2007 19:09:32 -0500
Subject: [Linux-cluster] fence_apc python script log
Message-ID: <46AE7DBC.2040804@cmiware.com>

Something appears amiss from the log that was generated.  I am attaching 
it for the developers.

Chris
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: apclog.txt
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070730/f0503d7b/attachment.txt>

From bernard.chew at muvee.com  Tue Jul 31 02:01:54 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Tue, 31 Jul 2007 10:01:54 +0800
Subject: [Linux-cluster] create GFS file system on imported iSCSI disk
In-Reply-To: <46ADFBD3.7060200@redhat.com>
References: <229C73600EB0E54DA818AB599482BCE901921863@shadowfax.sg.muvee.net>
	<46ADFBD3.7060200@redhat.com>
Message-ID: <229C73600EB0E54DA818AB599482BCE901921895@shadowfax.sg.muvee.net>

Thanks Bryn!

- Bernard

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bryn M. Reeves
Sent: Monday, July 30, 2007 10:55 PM
To: linux clustering
Subject: Re: [Linux-cluster] create GFS file system on imported iSCSI
disk

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bernard Chew wrote:
> May I know if we need to set up logical volumes on the imported iSCSI
> disk before creating GFS or we can immediately run "gfs_mkfs -p
lock_dlm
> -t alpha_cluster:gfs01 -j 8 /dev/sdb" (where /dev/sdb refers to the
> imported iSCSI disk) on one node and all nodes to "mount -t gfs
/dev/sdb
> /test"?
> 

There aren't any special steps needed for iSCSI over any other kind of
shared storage (once you've configured the initiators & the devices are
visible to the OS).

Creating volume groups and using logical volumes for GFS is a good idea
if you are likely to want to resize your devices at a later time but is
not strictly necessary.

Other than that, the steps you detailed should work fine.

Kind regards,
Bryn.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGrfvT6YSQoMYUY94RAgGFAJ96WnspeUgKHKiBwHRh71aluGcoUgCfXvrB
fg2wFdqf96s6kciF0ypfzB0=
=cTyA
-----END PGP SIGNATURE-----

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From kristoffer.lippert at jppol.dk  Tue Jul 31 08:32:06 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Tue, 31 Jul 2007 10:32:06 +0200
Subject: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <OFC8170BCF.A1271744-ON8525731E.004CDBE8-8525731E.004F1A36@jmsmucker.com>
References: <506B469CC6211B49BE28F6AC56BBCA35013BF882@STJH-P102.PSNL.CA>
	<OFC8170BCF.A1271744-ON8525731E.004CDBE8-8525731E.004F1A36@jmsmucker.com>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA57B261A@exchsrv07.rootdom.dk>

Hi Dan,
 
Have you got any performance figures on the OCFS2 ? 
 
My GFS is currently delivering the following performance:
(2 nodes, 2gbit FC HBAs multipath'ed + 1 Fujitsu SAN (SX60) 7x250gb raid 5.)
 
Writes: average 100mb/sec, 200 files/sec (with simple cp/rm)
Writes: max 150mb/sec, 250files/sec (with simple cp/rm)
Reads: average 160mb/sec (cat > dev/null - uncached)
Reads: average 1150mb/sec (cat > dev/null - cached)
Anybody have a feeling about this performance level?
 
Regards
Kristoffer
 

________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 16:19
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe



Ok Here is what I did to get it working: 

used RHCS to manage the cluster services 
used CLVM  to manage the LVM metadata across cluster 
        Coded a "locking_type = 3" in the LVM.CONF file 

Used OCFS2 to cluster and manage the filesystems across cluster. 

Configuration is 

2 node cluster runing nfs 

vitural IPAddr service to allow clients to mount the drives via NFS 

All appears to be working fine. I can fail over the services manually or even boot a server and the failover works great. 

Now the big question....is this supported by redhat and oracle ...hmmmm 


Thanks to all who helped get this working... 



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/f088c30c/attachment.htm>

From Dan.Askew at jmsmucker.com  Tue Jul 31 10:12:11 2007
From: Dan.Askew at jmsmucker.com (Dan.Askew at jmsmucker.com)
Date: Tue, 31 Jul 2007 06:12:11 -0400
Subject: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA57B261A@exchsrv07.rootdom.dk>
Message-ID: <OF65296377.33FDE5A2-ON85257329.0037E91C-85257329.0038873F@jmsmucker.com>

No I do not but if you elaborate on the tests performed or have atest 
script I would be glad run some bench marks.

It would also be helpful to know the hardware you are running on.


Dan Askew
Sr Systems Administrator
The J. M. Smucker Company

Phone (330)684-3662





"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com
07/31/2007 04:32 AM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com>
cc

Subject
SV: [Linux-cluster] Linux Clustering Newbe






Hi Dan,
 
Have you got any performance figures on the OCFS2 ? 
 
My GFS is currently delivering the following performance:
(2 nodes, 2gbit FC HBAs multipath'ed + 1 Fujitsu SAN (SX60) 7x250gb raid 
5.)
 
Writes: average 100mb/sec, 200 files/sec (with simple cp/rm)
Writes: max 150mb/sec, 250files/sec (with simple cp/rm)
Reads: average 160mb/sec (cat > dev/null - uncached)
Reads: average 1150mb/sec (cat > dev/null - cached)
Anybody have a feeling about this performance level?
 
Regards
Kristoffer
 

Fra: linux-cluster-bounces at redhat.com 
[mailto:linux-cluster-bounces at redhat.com] P? vegne af 
Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 16:19
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe


Ok Here is what I did to get it working: 

used RHCS to manage the cluster services 
used CLVM  to manage the LVM metadata across cluster 
        Coded a "locking_type = 3" in the LVM.CONF file 

Used OCFS2 to cluster and manage the filesystems across cluster. 

Configuration is 

2 node cluster runing nfs 

vitural IPAddr service to allow clients to mount the drives via NFS 

All appears to be working fine. I can fail over the services manually or 
even boot a server and the failover works great. 

Now the big question....is this supported by redhat and oracle ...hmmmm 


Thanks to all who helped get this working... 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/04235531/attachment.htm>

From kristoffer.lippert at jppol.dk  Tue Jul 31 10:47:25 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Tue, 31 Jul 2007 12:47:25 +0200
Subject: SV: SV: [Linux-cluster] Linux Clustering Newbe
In-Reply-To: <OF65296377.33FDE5A2-ON85257329.0037E91C-85257329.0038873F@jmsmucker.com>
References: <00B9BFA1C44A674794C9A1A4F5A22CA57B261A@exchsrv07.rootdom.dk>
	<OF65296377.33FDE5A2-ON85257329.0037E91C-85257329.0038873F@jmsmucker.com>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA57B2624@exchsrv07.rootdom.dk>

Hi,
 
 
I used sysbasher and the web interface on the san. I figure that it gives an ok view of the load on the san. It corresponds ok with timed cp's.
Also i did some loops that just copied to and from the san. But generally it's just a rough average, and i'm just wondering if it's similar in performance to other configurations.
 
The hardware is:
 2 fujitsusiemens RX200s3 servers with emulex 11003 hbas and two dualcore cpus and 4gb memory.
Connected is a SX60 san with an extra controller and powersupply. 
 

Mvh / Kind regards 

Kristoffer Lippert 
Systemansvarlig 
JP/Politiken A/S 
Online Magasiner 

Tlf. +45 8738 3032 
Cell. +45 6062 8703 

 

________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 31. juli 2007 12:12
Til: linux clustering
Emne: Re: SV: [Linux-cluster] Linux Clustering Newbe



No I do not but if you elaborate on the tests performed or have atest script I would be glad run some bench marks. 

It would also be helpful to know the hardware you are running on. 


Dan Askew
Sr Systems Administrator
The J. M. Smucker Company

Phone (330)684-3662





"Kristoffer Lippert" <kristoffer.lippert at jppol.dk> 
Sent by: linux-cluster-bounces at redhat.com 

07/31/2007 04:32 AM 
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com> 
cc
Subject
SV: [Linux-cluster] Linux Clustering Newbe

	




Hi Dan, 
  
Have you got any performance figures on the OCFS2 ? 
  
My GFS is currently delivering the following performance: 
(2 nodes, 2gbit FC HBAs multipath'ed + 1 Fujitsu SAN (SX60) 7x250gb raid 5.) 
  
Writes: average 100mb/sec, 200 files/sec (with simple cp/rm) 
Writes: max 150mb/sec, 250files/sec (with simple cp/rm) 
Reads: average 160mb/sec (cat > dev/null - uncached) 
Reads: average 1150mb/sec (cat > dev/null - cached) 
Anybody have a feeling about this performance level? 
  
Regards 
Kristoffer 
  


________________________________

Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Dan.Askew at jmsmucker.com
Sendt: 20. juli 2007 16:19
Til: linux clustering
Emne: [Linux-cluster] Linux Clustering Newbe


Ok Here is what I did to get it working: 

used RHCS to manage the cluster services 
used CLVM  to manage the LVM metadata across cluster 
       Coded a "locking_type = 3" in the LVM.CONF file 

Used OCFS2 to cluster and manage the filesystems across cluster. 

Configuration is 

2 node cluster runing nfs 

vitural IPAddr service to allow clients to mount the drives via NFS 

All appears to be working fine. I can fail over the services manually or even boot a server and the failover works great. 

Now the big question....is this supported by redhat and oracle ...hmmmm 


Thanks to all who helped get this working... 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/15833497/attachment.htm>

From janne.peltonen at helsinki.fi  Tue Jul 31 12:14:38 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Tue, 31 Jul 2007 15:14:38 +0300
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070710221922.GG18076@redhat.com>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
Message-ID: <20070731121438.GA21896@helsinki.fi>

On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote:
> On Fri, Jul 06, 2007 at 09:36:59PM +0300, Janne Peltonen wrote:
> > On Fri, Jul 06, 2007 at 02:31:51PM -0400, Lon Hohberger wrote:
> > > > > I forgot what this was... could you just mail me your original email
> > > > > off-list?
> > > > Nevermind, I found it
> > > > https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html
> > > I think it's actually the same problem as the 'status checks' being
> > > wrong in 2.0.24; just a different symptom.
> > > What architecture do you have?  I can build a package for you to test if
> > > you want.
> > 
> > x86_64
> > 
> > Thanks, it'd be nice.
> 
> http://people.redhat.com/lhh/rhel5-test
> 
> You'll need at least the updated cman package.  The -2.1lhh build of
> rgmanager is the one I just built today; the others are a bit older.

Well, I installed the new versions of the cman and rgmanager packages I
found there, but to no avail: I still get 1500 invocations of fs.sh per
second.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From janne.peltonen at helsinki.fi  Tue Jul 31 13:06:30 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Tue, 31 Jul 2007 16:06:30 +0300
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731121438.GA21896@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
Message-ID: <20070731130630.GC21896@helsinki.fi>

On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> > > > > https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html
> > > > I think it's actually the same problem as the 'status checks' being
> > > > wrong in 2.0.24; just a different symptom.
> > > > What architecture do you have?  I can build a package for you to test if
> > > > you want.
> > You'll need at least the updated cman package.  The -2.1lhh build of
> > rgmanager is the one I just built today; the others are a bit older.
> Well, I installed the new versions of the cman and rgmanager packages I
> found there, but to no avail: I still get 1500 invocations of fs.sh per
> second.

To be more exact, I'm not getting 1500 invocations per each second. Some
seconds get no invocations and others get more. They add up to sth like
12000-24000 invocations per minute, though.

The versions of packages:

[jmmpelto at pcn1 resources]$ rpm -qa rgmanager cman
rgmanager-2.0.27-2.1lhh.el5
cman-2.0.69-1.el5

Here's the sequence of invocations per second for one minute today
(after installing the new versions of the packages). They add up to
23277 invocations for that particular minute. I've got 13 services
running on that node, with a total of 26 cluster-controlled file
systems.

second	invocations
0	0
1	0
2	162
3	1229
4	1879
5	804
6	0
7	0
8	0
9	0
10	0
11	0
12	164
13	1299
14	1356
15	212
16	0
17	0
18	0
19	0
20	0
21	0
22	170
23	1531
24	2135
25	773
26	0
27	0
28	0
29	0
30	0
31	0
32	185
33	1225
34	1367
35	326
36	0
37	0
38	0
39	0
40	0
41	0
42	168
43	1171
44	1378
45	371
46	0
47	0
48	0
49	0
50	0
51	0
52	212
53	1327
54	2475
55	1356
56	2
57	0
58	0



-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From lhh at redhat.com  Tue Jul 31 13:41:21 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 31 Jul 2007 09:41:21 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731121438.GA21896@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
Message-ID: <20070731134121.GH4955@redhat.com>

On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote:
> > 
> > http://people.redhat.com/lhh/rhel5-test
> > 
> > You'll need at least the updated cman package.  The -2.1lhh build of
> > rgmanager is the one I just built today; the others are a bit older.
> 
> Well, I installed the new versions of the cman and rgmanager packages I
> found there, but to no avail: I still get 1500 invocations of fs.sh per
> second.

I put a log message in fs.sh:

Jul 31 09:27:29 bart clurgmgrd: [4395]: <err> /usr/share/cluster/fs.sh
TEST 

It comes up once every several (10-20) seconds like it's supposed to. 
What does your cluster.conf look like right now? 

Can you run:

rg_test rules

...and attach the output?  Maybe there's something locale-based that's
causing it to misbehave (but I can't imagine what).

The only way I can imagine this happening is:
(a) for some reason, the intervals for fs.sh checks are being set to 0
(rg_test should see this) *and* 

(b) for some other reason, the status check queueing is occurring
constantly instead of every 10 seconds like it should

I can prevent (a) from happening by not letting it be less than 1; but
that doesn't explain why you're seeing what you are seeing.

I'll be updating packages again probably later today or tomorrow, but
I don't think any of the fixed bugs are related to the fs.sh problem
you're experiencing at this point.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Tue Jul 31 13:49:18 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 31 Jul 2007 09:49:18 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731130630.GC21896@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731130630.GC21896@helsinki.fi>
Message-ID: <20070731134918.GI4955@redhat.com>

On Tue, Jul 31, 2007 at 04:06:30PM +0300, Janne Peltonen wrote:
> On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> > > > > > https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html
> > > > > I think it's actually the same problem as the 'status checks' being
> > > > > wrong in 2.0.24; just a different symptom.
> > > > > What architecture do you have?  I can build a package for you to test if
> > > > > you want.
> > > You'll need at least the updated cman package.  The -2.1lhh build of
> > > rgmanager is the one I just built today; the others are a bit older.
> > Well, I installed the new versions of the cman and rgmanager packages I
> > found there, but to no avail: I still get 1500 invocations of fs.sh per
> > second.
> 
> To be more exact, I'm not getting 1500 invocations per each second. Some
> seconds get no invocations and others get more. They add up to sth like
> 12000-24000 invocations per minute, though.
> 
> The versions of packages:
> 
> [jmmpelto at pcn1 resources]$ rpm -qa rgmanager cman
> rgmanager-2.0.27-2.1lhh.el5
> cman-2.0.69-1.el5
> 
> Here's the sequence of invocations per second for one minute today
> (after installing the new versions of the packages). They add up to
> 23277 invocations for that particular minute. I've got 13 services
> running on that node, with a total of 26 cluster-controlled file
> systems.

You shouldn't be seeing more than ~260 per minute, unless you have a
particularly strange service configuration (and even then, it shouldn't
be 23,000!).

What does your service configuration look like?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Tue Jul 31 13:50:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 31 Jul 2007 09:50:30 -0400
Subject: [Linux-cluster] IP failover
In-Reply-To: <200707301405.12814.pbruna@it-linux.cl>
References: <46ADF86E.6080002@transolutions.net>
	<200707301405.12814.pbruna@it-linux.cl>
Message-ID: <20070731135030.GJ4955@redhat.com>

On Mon, Jul 30, 2007 at 02:05:12PM -0400, Patricio Bruna V. wrote:
> El Monday 30 July 2007 10:40:46 James Wilson escribi?:
> > Is it possibel with rhcs5 to failover an IP without specifying the
> > shared storage. I have 2 servers being replicated by drbd
> > primary/primary and when one fails I need the ip address to switch to
> > the other server.
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> Yes, The IP would be the clustered service.

Like this:

	<service name="an_ip_address">
		<ip address="10.2.3.4"/>
	</service>

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From simanhew at gmail.com  Tue Jul 31 14:22:33 2007
From: simanhew at gmail.com (siman hew)
Date: Tue, 31 Jul 2007 10:22:33 -0400
Subject: [Linux-cluster] clustat on GULM
Message-ID: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com>

I found clustat report wrong information about rgmanager when a cluster is
GULM.
I setup a 3-node cluster on RHEL4U5, with GULM.  There are one failover
domain, one resource and one service, just for testing.
I found clustat always report rgmanger is running after cluster is started.
With the configuration with DLM, clustat report correctly.
node4 is lock server, and all nodes report the same inaccurate information.

Here is commands and report:

node4:~# service ccsd start
Starting ccsd:                                             [  OK  ]
node4:~# service lock_gulmd start
Starting lock_gulmd:                                       [  OK  ]
node4:~# clustat
msg_open: No route to host
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
   node4                                    Online, Local, rgmanager
   node3                                    Online, rgmanager
   node2                                    Online, rgmanager

node4:~# clustat -x
msg_open: No route to host
<?xml version="1.0"?>
<clustat version="4.1.1">
  <quorum quorate="1" groupmember="1"/>
  <nodes>
     <node name="node4" state="1" local="1" estranged="0"
rgmanager="1" nodeid="0xffff0000540610ac"/>
     <node name="node3" state="1" local="0" estranged="0"
rgmanager="1" nodeid="0xffff0000530610ac"/>
     <node name="node2" state="1" local="0" estranged="0"
rgmanager="1" nodeid="0xffff0000520610ac"/>
  </nodes>
</clustat>
node4:~# service rgmanager status
clurgmgrd is stopped
node4:~# service rgmanager start
Starting Cluster Service Manager:                          [  OK  ]
node4:~# service rgmanager status
clurgmgrd (pid 17151 17150) is running...
node4:~# clustat -x
<?xml version="1.0"?>
<clustat version="4.1.1">
  <quorum quorate="1" groupmember="1"/>
  <nodes>
     <node name="node4" state="1" local="1" estranged="0"
rgmanager="1" nodeid="0xffff0000540610ac"/>
     <node name="node3" state="1" local="0" estranged="0"
rgmanager="1" nodeid="0xffff0000530610ac"/>
     <node name="node2" state="1" local="0" estranged="0"
rgmanager="1" nodeid="0xffff0000520610ac"/>
  </nodes>
  <groups>
    <group name="s1" state="119" state_str="disabled"  owner="none"
last_owner="none" restarts="0" last_transition="0"
last_transition_str="Wed Dec 31 19:00:00 1969"/>
  </groups>
</clustat>

-----------------------------
Please note, first/seond clustat actually report an error: msg_open:
No route to host
When the cluster is DLM, there is another msg printed:
Resource Group Manager not running; no servie information available.

check rgmanager status, it is stopped.  but report said rgmanager
running on all nodes, and groupmember is set on.

is it supposed to be that?
Any hint are very welcome.

Thanks,

Siman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/1c9e46c8/attachment.htm>

From jparsons at redhat.com  Tue Jul 31 14:36:36 2007
From: jparsons at redhat.com (James Parsons)
Date: Tue, 31 Jul 2007 10:36:36 -0400
Subject: [Linux-cluster] fence_apc failure
In-Reply-To: <46AE7383.6090706@cmiware.com>
References: <46AE6DC3.1060807@cmiware.com> <46AE7383.6090706@cmiware.com>
Message-ID: <46AF48F4.8090701@redhat.com>

Chris,

If you run /sbin/fence_apc -V, and the response is "fence_apc 1.32.45", 
then you are using the latest
apc agent that has been updated to handle the 3.x apc firmware.

If you are running the latest agent, then you are obviously encountering 
a problem. It is possible that it has something to do with the asterisk 
that is appearing after an outlet group. Please let me know if you are 
running 1.32.45...this should be able to be fixed today, if you can help 
with some testing.

Some background on the fence_apc agent for those who care:
For years, the apc agent was written in perl. As the apc firmware 
evolved from release to release, the regexp match patterns for screen 
scraping the telnet session grew uglier and uglier. In addition, the 
perl agent did not support outlet naming or grouping, and did not handle 
the larger switches that took 2 or more screens to list out the 
available outlets. To add these features and to make the agent easier to 
maintain, it was decided to rewrite the agent in python. This was done 
while apc firmware was still in the 2.x series.

After some errors with MasterSwitchPlus switches were fixed, the agent 
worked well in the field...until version 3.x firmware was released. This 
firmware release changed ALOT of things, including screen order.

The current agent version worked well on my older apc switches as well 
as the ones with newer firmware, so I released it into the beta and plan 
on releasing it as an async errata release for RHEL4.5 Cluster Suite.

I will try and reproduce this issue this morning and have something for 
you to test today.

By the way, there is also a fence_apc_snmp agent. It works great, but we 
have not switched to using it exclusively yet, because some admins don't 
like having snmp packages on their systems - but with the pain that 
trying to maintain the telnet version of this agent is causing, it is 
making me lean more and more towards including just one apc solution - 
snmp. :-/

Thanks for your patience,

-Jim

Chris Harms wrote:

> This appears to be a "no."
> fence_node[26827]: agent "fence_apc" reports: Power Off 
> unsuccessfulStatus check successful. Port 4 is ON
>
> However the node was powered off so something worked.  Also, it 
> appears to have sent an Off command instead of Reboot.
>
>
> Chris Harms wrote:
>
>> We now have some APC 7931 units at our disposal, however the 
>> fence_apc perl script fails with "unrecognized menu response."  This 
>> appears to be from new firmware on the APC units.  There is a python 
>> script in CVS that looks like it may operate correctly with the new 
>> firmware menus, is this correct?  Also, is it a drop-in replacement 
>> for the perl script, i.e. will saving it as /sbin/fence_apc work with 
>> RHCS 5?
>>
>> Thanks,
>> Chris
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster





From jbrassow at redhat.com  Tue Jul 31 14:47:34 2007
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Tue, 31 Jul 2007 09:47:34 -0500
Subject: [Linux-cluster] lvremove error
In-Reply-To: <1185821897.3225.8.camel@localhost.localdomain>
References: <OFC7A60F58.EA204473-ON85257326.006E3290-85257326.006E3EE4@csc.com>
	<87363102-11BD-4A9D-9703-95FF0BED0A9E@redhat.com>
	<1185821290.3225.5.camel@localhost.localdomain>
	<1185821897.3225.8.camel@localhost.localdomain>
Message-ID: <D85C9B77-492C-441C-BA2D-EE590E154210@redhat.com>

Hmmm, yes.  I think this is a bug in the way the CLI handles cluster  
volumes vs. single machine volumes.

On a single machine, if the volume is not in-use (active but not  
mounted), an lvremove will deactivate and remove it.

In a cluster, the lvremove seems to skip the deactivate portion of  
that, forcing you to do:
$> lvchange -an ...
$> lvremove ...
to get similar effect.

The GUI seems to be hiding this "bug" from the end-user.

  brassow

On Jul 30, 2007, at 1:58 PM, jim parsons wrote:

>
> Follow-up:
>
>  luci just calls the same command line command on the ricci agent.
> ricci does  lvchange -an which will error out if the volume is in  
> use by
> other cluster members
> We just confirmed this behavior on a test cluster.
>
> -J
>
> On Mon, 2007-07-30 at 14:48 -0400, jim parsons wrote:
>> Thanks for this notice, Jonathan; I will check if this is in error...
>>
>> -j
>>
>> On Mon, 2007-07-30 at 13:16 -0500, Jonathan Brassow wrote:
>>> lvremove doesn't seem like it wants to remove the volume because  
>>> it is
>>> in use on another node.  Is this true?  If so, the GUI shouldn't be
>>> removing it either.
>>>
>>>
>>>  brassow
>>>
>>> On Jul 28, 2007, at 3:04 PM, Mahmoud Hanafi wrote:
>>>
>>>>
>>>>
>>>> I am trying to remove a lv via cmd line but I get the following
>>>> error.
>>>>
>>>> [root at nfs1 /]# lvremove /dev/gfs_vg/gfs_st2_sz512
>>>> Do you really want to remove active logical volume "gfs_st2_sz512"?
>>>> [y/n]: y
>>>>   Error locking on node nfs1.local: Volume is busy on another node
>>>>   Can't get exclusive access to volume "gfs_st2_sz512"
>>>>
>>>> But I can remove lv via the GUI (luci).
>>>>
>>>> What am I doing wrong on the cmd line.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Mahmoud
>>>>
>>>> ------------------------------------------------------------------- 
>>>> ------------------------------------------------------------------- 
>>>> ------------------------------------------
>>>> This is a PRIVATE message. If you are not the intended recipient,
>>>> please delete without copying and kindly advise us by e-mail of the
>>>> mistake in delivery. NOTE: Regardless of content, this e-mail shall
>>>> not operate to bind CSC to any order or other contract unless
>>>> pursuant to explicit written agreement or government initiative
>>>> expressly permitting the use of e-mail for such purpose.
>>>> ------------------------------------------------------------------- 
>>>> ------------------------------------------------------------------- 
>>>> ------------------------------------------
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From janne.peltonen at helsinki.fi  Tue Jul 31 14:54:41 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Tue, 31 Jul 2007 17:54:41 +0300
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731134121.GH4955@redhat.com>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
Message-ID: <20070731145441.GE21896@helsinki.fi>

On Tue, Jul 31, 2007 at 09:41:21AM -0400, Lon Hohberger wrote:
> On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> > On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote:
> > > 
> > > http://people.redhat.com/lhh/rhel5-test
> > > 
> > > You'll need at least the updated cman package.  The -2.1lhh build of
> > > rgmanager is the one I just built today; the others are a bit older.
> > 
> > Well, I installed the new versions of the cman and rgmanager packages I
> > found there, but to no avail: I still get 1500 invocations of fs.sh per
> > second.
> 
> I put a log message in fs.sh:
> 
> Jul 31 09:27:29 bart clurgmgrd: [4395]: <err> /usr/share/cluster/fs.sh
> TEST 
> 
> It comes up once every several (10-20) seconds like it's supposed to. 

I did the same, with the same results. It seems to me that the clurgmgrd
process isn't calling the complete script any more times than it's
supposed to.  What I'm seeing are the execs of fs.sh, that is, it
includes each () and `` and so on. Each fs.sh invocation seems to create
quite an amount of subshells.

I'm sorry for having misled you. And this all means, there isn't
probably much reason to read the cluster.conf and rg_test rules output -
I'll attach them anyway.

> What does your cluster.conf look like right now? 

Like this:

--clip--
<?xml version="1.0"?>
<cluster config_version="45" name="mappi-primary">
	<fence_daemon post_fail_delay="0" post_join_delay="30"/>
	<clusternodes>
		<clusternode name="pcnm-hb" nodeid="100" votes="1">
			<fence>
				<method name="1">
					<device name="pcnm-fence-ilo"/>
				</method>
				<method name="2">
					<device name="pcnm-manual" nodename="pcnm-hb"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="pcn1-hb" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="pcn1-fence-ilo"/>
				</method>
				<method name="2">
					<device name="pcn1-manual" nodename="pcn3-hb"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="pcn2-hb" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="pcn2-fence-ilo"/>
				</method>
				<method name="2">
					<device name="pcn2-manual" nodename="pcn3-hb"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="pcn3-hb" nodeid="3" votes="1">
			<fence>
				<method name="1">
					<device name="pcn3-fence-ilo"/>
				</method>
				<method name="2">
					<device name="pcn3-manual" nodename="pcn3-hb"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="pcn4-hb" nodeid="4" votes="1">
			<fence>
				<method name="1">
					<device name="pcn4-fence-ilo"/>
				</method>
				<method name="2">
					<device name="pcn4-manual" nodename="pcn4-hb"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman/>
	<fencedevices>
		<fencedevice agent="fence_ilo" hostname="pcnm-iloin" login="Administrator" name="pcnm-fence-ilo" passwd="<cencored>"/>
		<fencedevice agent="fence_ilo" hostname="pcn1-iloin" login="Administrator" name="pcn1-fence-ilo" passwd="<cencored>"/>
		<fencedevice agent="fence_ilo" hostname="pcn2-iloin" login="Administrator" name="pcn2-fence-ilo" passwd="<cencored>"/>
		<fencedevice agent="fence_ilo" hostname="pcn3-iloin" login="Administrator" name="pcn3-fence-ilo" passwd="<cencored>"/>
		<fencedevice agent="fence_ilo" hostname="pcn4-iloin" login="Administrator" name="pcn4-fence-ilo" passwd="<cencored>"/>
		<fencedevice agent="fence_manual" name="pcnm-manual"/>
		<fencedevice agent="fence_manual" name="pcn1-manual"/>
		<fencedevice agent="fence_manual" name="pcn2-manual"/>
		<fencedevice agent="fence_manual" name="pcn3-manual"/>
		<fencedevice agent="fence_manual" name="pcn4-manual"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="im" restricted="1">
				<failoverdomainnode name="pcnm-hb"/>
			</failoverdomain>
			<failoverdomain name="akk" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="bio" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="biz" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="edu" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="el" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="hal" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="hum" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="ii" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="lib" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="medi" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="ml" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="mm" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="oik" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="teol" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="valt" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p01" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p02" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p03" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn3-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p04" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p05" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p06" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn3-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p07" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p08" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p09" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p10" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p11" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="p12" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i01" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i02" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i03" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn3-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i04" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i05" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i06" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn3-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i07" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i08" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i09" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i10" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i11" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i12" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i13" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i14" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i15" ordered="1" restricted="1">
				<failoverdomainnode name="pcn1-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn3-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i16" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i17" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn4-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i18" ordered="1" restricted="1">
				<failoverdomainnode name="pcn2-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn3-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i19" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i20" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn4-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i21" ordered="1" restricted="1">
				<failoverdomainnode name="pcn3-hb" priority="1"/>
				<failoverdomainnode name="pcn4-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn2-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i22" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn1-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i23" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn2-hb" priority="2"/>
				<failoverdomainnode name="pcn1-hb" priority="3"/>
				<failoverdomainnode name="pcn3-hb" priority="4"/>
			</failoverdomain>
			<failoverdomain name="i24" ordered="1" restricted="1">
				<failoverdomainnode name="pcn4-hb" priority="1"/>
				<failoverdomainnode name="pcn3-hb" priority="2"/>
				<failoverdomainnode name="pcn2-hb" priority="3"/>
				<failoverdomainnode name="pcn1-hb" priority="4"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="128.214.20.163" monitor_link="1"/>
			<ip address="128.214.20.139" monitor_link="1"/>
			<ip address="128.214.20.140" monitor_link="1"/>
			<ip address="128.214.20.141" monitor_link="1"/>
			<ip address="128.214.20.142" monitor_link="1"/>
			<ip address="128.214.20.143" monitor_link="1"/>
			<ip address="128.214.20.144" monitor_link="1"/>
			<ip address="128.214.20.145" monitor_link="1"/>
			<ip address="128.214.20.146" monitor_link="1"/>
			<ip address="128.214.20.147" monitor_link="1"/>
			<ip address="128.214.20.148" monitor_link="1"/>
			<ip address="128.214.20.149" monitor_link="1"/>
			<ip address="128.214.20.150" monitor_link="1"/>
			<ip address="128.214.20.165" monitor_link="1"/>
			<ip address="128.214.20.166" monitor_link="1"/>
			<ip address="128.214.20.167" monitor_link="1"/>
			<ip address="128.214.20.168" monitor_link="1"/>
			<ip address="128.214.20.169" monitor_link="1"/>
			<ip address="128.214.20.170" monitor_link="1"/>
			<ip address="128.214.20.171" monitor_link="1"/>
			<ip address="128.214.20.172" monitor_link="1"/>
			<ip address="128.214.20.173" monitor_link="1"/>
			<ip address="128.214.20.174" monitor_link="1"/>
			<ip address="128.214.20.175" monitor_link="1"/>
			<ip address="128.214.20.176" monitor_link="1"/>
			<ip address="128.214.20.189" monitor_link="1"/>
			<ip address="128.214.20.193" monitor_link="1"/>
			<ip address="128.214.20.194" monitor_link="1"/>
			<ip address="128.214.20.195" monitor_link="1"/>
			<ip address="128.214.20.196" monitor_link="1"/>
			<ip address="128.214.20.197" monitor_link="1"/>
			<ip address="128.214.20.198" monitor_link="1"/>
			<ip address="128.214.20.199" monitor_link="1"/>
			<ip address="128.214.20.200" monitor_link="1"/>
			<ip address="128.214.20.201" monitor_link="1"/>
			<ip address="128.214.20.202" monitor_link="1"/>
			<ip address="128.214.20.203" monitor_link="1"/>
			<ip address="128.214.20.204" monitor_link="1"/>
			<ip address="128.214.20.205" monitor_link="1"/>
			<ip address="128.214.20.206" monitor_link="1"/>
			<ip address="128.214.20.207" monitor_link="1"/>
			<ip address="128.214.20.208" monitor_link="1"/>
			<ip address="128.214.20.209" monitor_link="1"/>
			<ip address="128.214.20.210" monitor_link="1"/>
			<ip address="128.214.20.211" monitor_link="1"/>
			<ip address="128.214.20.212" monitor_link="1"/>
			<ip address="128.214.20.213" monitor_link="1"/>
			<ip address="128.214.20.214" monitor_link="1"/>
			<ip address="128.214.20.215" monitor_link="1"/>
			<ip address="128.214.20.216" monitor_link="1"/>
			<ip address="128.214.20.217" monitor_link="1"/>
			<ip address="128.214.20.218" monitor_link="1"/>
			<fs device="/dev/mappi-primary/akk-config" force_fsck="0" force_unmount="1" fsid="400001" fstype="ext3" mountpoint="/var/lib/imap/akk" name="akk-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/bio-config" force_fsck="0" force_unmount="1" fsid="400002" fstype="ext3" mountpoint="/var/lib/imap/bio" name="bio-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/biz-config" force_fsck="0" force_unmount="1" fsid="400003" fstype="ext3" mountpoint="/var/lib/imap/biz" name="biz-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/edu-config" force_fsck="0" force_unmount="1" fsid="400004" fstype="ext3" mountpoint="/var/lib/imap/edu" name="edu-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/el-config" force_fsck="0" force_unmount="1" fsid="400005" fstype="ext3" mountpoint="/var/lib/imap/el" name="el-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/hal-config" force_fsck="0" force_unmount="1" fsid="400006" fstype="ext3" mountpoint="/var/lib/imap/hal" name="hal-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/hum-config" force_fsck="0" force_unmount="1" fsid="400007" fstype="ext3" mountpoint="/var/lib/imap/hum" name="hum-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/ii-config" force_fsck="0" force_unmount="1" fsid="400008" fstype="ext3" mountpoint="/var/lib/imap/ii" name="ii-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/lib-config" force_fsck="0" force_unmount="1" fsid="400009" fstype="ext3" mountpoint="/var/lib/imap/lib" name="lib-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/medi-config" force_fsck="0" force_unmount="1" fsid="400010" fstype="ext3" mountpoint="/var/lib/imap/medi" name="medi-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/ml-config" force_fsck="0" force_unmount="1" fsid="400011" fstype="ext3" mountpoint="/var/lib/imap/ml" name="ml-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/mm-config" force_fsck="0" force_unmount="1" fsid="400012" fstype="ext3" mountpoint="/var/lib/imap/mm" name="mm-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/oik-config" force_fsck="0" force_unmount="1" fsid="400013" fstype="ext3" mountpoint="/var/lib/imap/oik" name="oik-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/teol-config" force_fsck="0" force_unmount="1" fsid="400014" fstype="ext3" mountpoint="/var/lib/imap/teol" name="teol-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/valt-config" force_fsck="0" force_unmount="1" fsid="400015" fstype="ext3" mountpoint="/var/lib/imap/valt" name="valt-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/akk-default" force_fsck="0" force_unmount="1" fsid="500001" fstype="ext3" mountpoint="/var/spool/imap/akk" name="akk-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/bio-default" force_fsck="0" force_unmount="1" fsid="500002" fstype="ext3" mountpoint="/var/spool/imap/bio" name="bio-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/biz-default" force_fsck="0" force_unmount="1" fsid="500003" fstype="ext3" mountpoint="/var/spool/imap/biz" name="biz-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/edu-default" force_fsck="0" force_unmount="1" fsid="500004" fstype="ext3" mountpoint="/var/spool/imap/edu" name="edu-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/el-default" force_fsck="0" force_unmount="1" fsid="500005" fstype="ext3" mountpoint="/var/spool/imap/el" name="el-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/hal-default" force_fsck="0" force_unmount="1" fsid="500006" fstype="ext3" mountpoint="/var/spool/imap/hal" name="hal-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/hum-default" force_fsck="0" force_unmount="1" fsid="500007" fstype="ext3" mountpoint="/var/spool/imap/hum" name="hum-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/ii-default" force_fsck="0" force_unmount="1" fsid="500008" fstype="ext3" mountpoint="/var/spool/imap/ii" name="ii-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/lib-default" force_fsck="0" force_unmount="1" fsid="500009" fstype="ext3" mountpoint="/var/spool/imap/lib" name="lib-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/medi-default" force_fsck="0" force_unmount="1" fsid="500010" fstype="ext3" mountpoint="/var/spool/imap/medi" name="medi-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/ml-default" force_fsck="0" force_unmount="1" fsid="500011" fstype="ext3" mountpoint="/var/spool/imap/ml" name="ml-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/mm-default" force_fsck="0" force_unmount="1" fsid="500012" fstype="ext3" mountpoint="/var/spool/imap/mm" name="mm-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/oik-default" force_fsck="0" force_unmount="1" fsid="500013" fstype="ext3" mountpoint="/var/spool/imap/oik" name="oik-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/teol-default" force_fsck="0" force_unmount="1" fsid="500014" fstype="ext3" mountpoint="/var/spool/imap/teol" name="teol-default" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/valt-default" force_fsck="0" force_unmount="1" fsid="500015" fstype="ext3" mountpoint="/var/spool/imap/valt" name="valt-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p01-default" force_fsck="0" force_unmount="1" fsid="800001" fstype="ext3" mountpoint="/var/spool/imap/p01" name="p01-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p02-default" force_fsck="0" force_unmount="1" fsid="800002" fstype="ext3" mountpoint="/var/spool/imap/p02" name="p02-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p03-default" force_fsck="0" force_unmount="1" fsid="800003" fstype="ext3" mountpoint="/var/spool/imap/p03" name="p03-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p04-default" force_fsck="0" force_unmount="1" fsid="800004" fstype="ext3" mountpoint="/var/spool/imap/p04" name="p04-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p05-default" force_fsck="0" force_unmount="1" fsid="800005" fstype="ext3" mountpoint="/var/spool/imap/p05" name="p05-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p06-default" force_fsck="0" force_unmount="1" fsid="800006" fstype="ext3" mountpoint="/var/spool/imap/p06" name="p06-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p07-default" force_fsck="0" force_unmount="1" fsid="800007" fstype="ext3" mountpoint="/var/spool/imap/p07" name="p07-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p08-default" force_fsck="0" force_unmount="1" fsid="800008" fstype="ext3" mountpoint="/var/spool/imap/p08" name="p08-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p09-default" force_fsck="0" force_unmount="1" fsid="800009" fstype="ext3" mountpoint="/var/spool/imap/p09" name="p09-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p10-default" force_fsck="0" force_unmount="1" fsid="800010" fstype="ext3" mountpoint="/var/spool/imap/p10" name="p10-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p11-default" force_fsck="0" force_unmount="1" fsid="800011" fstype="ext3" mountpoint="/var/spool/imap/p11" name="p11-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p12-default" force_fsck="0" force_unmount="1" fsid="800012" fstype="ext3" mountpoint="/var/spool/imap/p12" name="p12-default" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p01-config" force_fsck="0" force_unmount="1" fsid="900001" fstype="ext3" mountpoint="/var/lib/imap/p01" name="p01-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p02-config" force_fsck="0" force_unmount="1" fsid="900002" fstype="ext3" mountpoint="/var/lib/imap/p02" name="p02-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p03-config" force_fsck="0" force_unmount="1" fsid="900003" fstype="ext3" mountpoint="/var/lib/imap/p03" name="p03-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p04-config" force_fsck="0" force_unmount="1" fsid="900004" fstype="ext3" mountpoint="/var/lib/imap/p04" name="p04-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p05-config" force_fsck="0" force_unmount="1" fsid="900005" fstype="ext3" mountpoint="/var/lib/imap/p05" name="p05-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p06-config" force_fsck="0" force_unmount="1" fsid="900006" fstype="ext3" mountpoint="/var/lib/imap/p06" name="p06-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p07-config" force_fsck="0" force_unmount="1" fsid="900007" fstype="ext3" mountpoint="/var/lib/imap/p07" name="p07-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p08-config" force_fsck="0" force_unmount="1" fsid="900008" fstype="ext3" mountpoint="/var/lib/imap/p08" name="p08-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p09-config" force_fsck="0" force_unmount="1" fsid="900009" fstype="ext3" mountpoint="/var/lib/imap/p09" name="p09-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p10-config" force_fsck="0" force_unmount="1" fsid="900010" fstype="ext3" mountpoint="/var/lib/imap/p10" name="p10-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p11-config" force_fsck="0" force_unmount="1" fsid="900011" fstype="ext3" mountpoint="/var/lib/imap/p11" name="p11-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/p12-config" force_fsck="0" force_unmount="1" fsid="900012" fstype="ext3" mountpoint="/var/lib/imap/p12" name="p12-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i01-config" force_fsck="0" force_unmount="1" fsid="100001" fstype="ext3" mountpoint="/var/lib/imap/i01" name="i01-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i02-config" force_fsck="0" force_unmount="1" fsid="100002" fstype="ext3" mountpoint="/var/lib/imap/i02" name="i02-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i03-config" force_fsck="0" force_unmount="1" fsid="100003" fstype="ext3" mountpoint="/var/lib/imap/i03" name="i03-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i04-config" force_fsck="0" force_unmount="1" fsid="100004" fstype="ext3" mountpoint="/var/lib/imap/i04" name="i04-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i05-config" force_fsck="0" force_unmount="1" fsid="100005" fstype="ext3" mountpoint="/var/lib/imap/i05" name="i05-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i06-config" force_fsck="0" force_unmount="1" fsid="100006" fstype="ext3" mountpoint="/var/lib/imap/i06" name="i06-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i07-config" force_fsck="0" force_unmount="1" fsid="100007" fstype="ext3" mountpoint="/var/lib/imap/i07" name="i07-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i08-config" force_fsck="0" force_unmount="1" fsid="100008" fstype="ext3" mountpoint="/var/lib/imap/i08" name="i08-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i09-config" force_fsck="0" force_unmount="1" fsid="100009" fstype="ext3" mountpoint="/var/lib/imap/i09" name="i09-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i10-config" force_fsck="0" force_unmount="1" fsid="100010" fstype="ext3" mountpoint="/var/lib/imap/i10" name="i10-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i11-config" force_fsck="0" force_unmount="1" fsid="100011" fstype="ext3" mountpoint="/var/lib/imap/i11" name="i11-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i12-config" force_fsck="0" force_unmount="1" fsid="100012" fstype="ext3" mountpoint="/var/lib/imap/i12" name="i12-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i13-config" force_fsck="0" force_unmount="1" fsid="100013" fstype="ext3" mountpoint="/var/lib/imap/i13" name="i13-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i14-config" force_fsck="0" force_unmount="1" fsid="100014" fstype="ext3" mountpoint="/var/lib/imap/i14" name="i14-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i15-config" force_fsck="0" force_unmount="1" fsid="100015" fstype="ext3" mountpoint="/var/lib/imap/i15" name="i15-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i16-config" force_fsck="0" force_unmount="1" fsid="100016" fstype="ext3" mountpoint="/var/lib/imap/i16" name="i16-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i17-config" force_fsck="0" force_unmount="1" fsid="100017" fstype="ext3" mountpoint="/var/lib/imap/i17" name="i17-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i18-config" force_fsck="0" force_unmount="1" fsid="100018" fstype="ext3" mountpoint="/var/lib/imap/i18" name="i18-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i19-config" force_fsck="0" force_unmount="1" fsid="100019" fstype="ext3" mountpoint="/var/lib/imap/i19" name="i19-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i20-config" force_fsck="0" force_unmount="1" fsid="100020" fstype="ext3" mountpoint="/var/lib/imap/i20" name="i20-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i21-config" force_fsck="0" force_unmount="1" fsid="100021" fstype="ext3" mountpoint="/var/lib/imap/i21" name="i21-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i22-config" force_fsck="0" force_unmount="1" fsid="100022" fstype="ext3" mountpoint="/var/lib/imap/i22" name="i22-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i23-config" force_fsck="0" force_unmount="1" fsid="100023" fstype="ext3" mountpoint="/var/lib/imap/i23" name="i23-config" options="" self_fence="0"/>
 			<fs device="/dev/mappi-primary/i24-config" force_fsck="0" force_unmount="1" fsid="100024" fstype="ext3" mountpoint="/var/lib/imap/i24" name="i24-config" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/aaal" force_fsck="0" force_unmount="1" fsid="300001" fstype="ext3" mountpoint="/var/spool/imap/aaal" name="aaal" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/amav" force_fsck="0" force_unmount="1" fsid="300002" fstype="ext3" mountpoint="/var/spool/imap/amav" name="amav" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/awei" force_fsck="0" force_unmount="1" fsid="300003" fstype="ext3" mountpoint="/var/spool/imap/awei" name="awei" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/ejga" force_fsck="0" force_unmount="1" fsid="300004" fstype="ext3" mountpoint="/var/spool/imap/ejga" name="ejga" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/gbhk" force_fsck="0" force_unmount="1" fsid="300005" fstype="ext3" mountpoint="/var/spool/imap/gbhk" name="gbhk" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/hlhz" force_fsck="0" force_unmount="1" fsid="300006" fstype="ext3" mountpoint="/var/spool/imap/hlhz" name="hlhz" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/iaji" force_fsck="0" force_unmount="1" fsid="300007" fstype="ext3" mountpoint="/var/spool/imap/iaji" name="iaji" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/jjjr" force_fsck="0" force_unmount="1" fsid="300008" fstype="ext3" mountpoint="/var/spool/imap/jjjr" name="jjjr" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/jskd" force_fsck="0" force_unmount="1" fsid="300009" fstype="ext3" mountpoint="/var/spool/imap/jskd" name="jskd" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/keko" force_fsck="0" force_unmount="1" fsid="300010" fstype="ext3" mountpoint="/var/spool/imap/keko" name="keko" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/kplh" force_fsck="0" force_unmount="1" fsid="300011" fstype="ext3" mountpoint="/var/spool/imap/kplh" name="kplh" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/lilz" force_fsck="0" force_unmount="1" fsid="300012" fstype="ext3" mountpoint="/var/spool/imap/lilz" name="lilz" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/mami" force_fsck="0" force_unmount="1" fsid="300013" fstype="ext3" mountpoint="/var/spool/imap/mami" name="mami" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/mjmp" force_fsck="0" force_unmount="1" fsid="300014" fstype="ext3" mountpoint="/var/spool/imap/mjmp" name="mjmp" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/mqnj" force_fsck="0" force_unmount="1" fsid="300015" fstype="ext3" mountpoint="/var/spool/imap/mqnj" name="mqnj" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/nkpd" force_fsck="0" force_unmount="1" fsid="300016" fstype="ext3" mountpoint="/var/spool/imap/nkpd" name="nkpd" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/peps" force_fsck="0" force_unmount="1" fsid="300017" fstype="ext3" mountpoint="/var/spool/imap/peps" name="peps" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/ptro" force_fsck="0" force_unmount="1" fsid="300018" fstype="ext3" mountpoint="/var/spool/imap/ptro" name="ptro" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/rpsi" force_fsck="0" force_unmount="1" fsid="300019" fstype="ext3" mountpoint="/var/spool/imap/rpsi" name="rpsi" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/sjss" force_fsck="0" force_unmount="1" fsid="300020" fstype="ext3" mountpoint="/var/spool/imap/sjss" name="sjss" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/stth" force_fsck="0" force_unmount="1" fsid="300021" fstype="ext3" mountpoint="/var/spool/imap/stth" name="stth" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/tits" force_fsck="0" force_unmount="1" fsid="300022" fstype="ext3" mountpoint="/var/spool/imap/tits" name="tits" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/ttvi" force_fsck="0" force_unmount="1" fsid="300023" fstype="ext3" mountpoint="/var/spool/imap/ttvi" name="ttvi" options="" self_fence="0"/>
			<fs device="/dev/mappi-primary/vjzz" force_fsck="0" force_unmount="1" fsid="300024" fstype="ext3" mountpoint="/var/spool/imap/vjzz" name="vjzz" options="" self_fence="0"/>
		</resources>
		<service autostart="1" domain="akk" name="akk" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.akk.oldmappi" name="initd.akk.oldmappi"/>
			<ip ref="128.214.20.189"/>
			<fs ref="akk-config"/>
			<fs ref="akk-default"/>
		</service>
		<service autostart="1" domain="bio" name="bio" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.bio.oldmappi" name="initd.bio.oldmappi"/>
			<ip ref="128.214.20.193"/>
			<fs ref="bio-config"/>
			<fs ref="bio-default"/>
		</service>
		<service autostart="1" domain="biz" name="biz" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.biz.oldmappi" name="initd.biz.oldmappi"/>
			<ip ref="128.214.20.194"/>
			<fs ref="biz-config"/>
			<fs ref="biz-default"/>
		</service>
		<service autostart="1" domain="edu" name="edu" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.edu.oldmappi" name="initd.edu.oldmappi"/>
			<ip ref="128.214.20.195"/>
			<fs ref="edu-config"/>
			<fs ref="edu-default"/>
		</service>
		<service autostart="1" domain="el" name="el" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.el.oldmappi" name="initd.el.oldmappi"/>
			<ip ref="128.214.20.196"/>
			<fs ref="el-config"/>
			<fs ref="el-default"/>
		</service>
		<service autostart="1" domain="hal" name="hal" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.hal.oldmappi" name="initd.hal.oldmappi"/>
			<ip ref="128.214.20.197"/>
			<fs ref="hal-config"/>
			<fs ref="hal-default"/>
		</service>
		<service autostart="1" domain="hum" name="hum" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.hum.oldmappi" name="initd.hum.oldmappi"/>
			<ip ref="128.214.20.198"/>
			<fs ref="hum-config"/>
			<fs ref="hum-default"/>
		</service>
		<service autostart="1" domain="ii" name="ii" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.ii.oldmappi" name="initd.ii.oldmappi"/>
			<ip ref="128.214.20.199"/>
			<fs ref="ii-config"/>
			<fs ref="ii-default"/>
		</service>
		<service autostart="1" domain="lib" name="lib" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.lib.oldmappi" name="initd.lib.oldmappi"/>
			<ip ref="128.214.20.200"/>
			<fs ref="lib-config"/>
			<fs ref="lib-default"/>
		</service>
		<service autostart="1" domain="medi" name="medi" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.medi.oldmappi" name="initd.medi.oldmappi"/>
			<ip ref="128.214.20.201"/>
			<fs ref="medi-config"/>
			<fs ref="medi-default"/>
		</service>
		<service autostart="1" domain="ml" name="ml" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.ml.oldmappi" name="initd.ml.oldmappi"/>
			<ip ref="128.214.20.202"/>
			<fs ref="ml-config"/>
			<fs ref="ml-default"/>
		</service>
		<service autostart="1" domain="mm" name="mm" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.mm.oldmappi" name="initd.mm.oldmappi"/>
			<ip ref="128.214.20.203"/>
			<fs ref="mm-config"/>
			<fs ref="mm-default"/>
		</service>
		<service autostart="1" domain="oik" name="oik" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.oik.oldmappi" name="initd.oik.oldmappi"/>
			<ip ref="128.214.20.204"/>
			<fs ref="oik-config"/>
			<fs ref="oik-default"/>
		</service>
		<service autostart="1" domain="teol" name="teol" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.teol.oldmappi" name="initd.teol.oldmappi"/>
			<ip ref="128.214.20.205"/>
			<fs ref="teol-config"/>
			<fs ref="teol-default"/>
		</service>
		<service autostart="1" domain="valt" name="valt" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.valt.oldmappi" name="initd.valt.oldmappi"/>
			<ip ref="128.214.20.206"/>
			<fs ref="valt-config"/>
			<fs ref="valt-default"/>
		</service>
		<service autostart="1" domain="p01" name="p01" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p01.oldmappi" name="initd.p01.oldmappi"/>
			<ip ref="128.214.20.207"/>
			<fs ref="p01-config"/>
			<fs ref="p01-default"/>
		</service>
		<service autostart="1" domain="p02" name="p02" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p02.oldmappi" name="initd.p02.oldmappi"/>
			<ip ref="128.214.20.208"/>
			<fs ref="p02-config"/>
			<fs ref="p02-default"/>
		</service>
		<service autostart="1" domain="p03" name="p03" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p03.oldmappi" name="initd.p03.oldmappi"/>
			<ip ref="128.214.20.209"/>
			<fs ref="p03-config"/>
			<fs ref="p03-default"/>
		</service>
		<service autostart="1" domain="p04" name="p04" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p04.oldmappi" name="initd.p04.oldmappi"/>
			<ip ref="128.214.20.210"/>
			<fs ref="p04-config"/>
			<fs ref="p04-default"/>
		</service>
		<service autostart="1" domain="p05" name="p05" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p05.oldmappi" name="initd.p05.oldmappi"/>
			<ip ref="128.214.20.211"/>
			<fs ref="p05-config"/>
			<fs ref="p05-default"/>
		</service>
		<service autostart="1" domain="p06" name="p06" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p06.oldmappi" name="initd.p06.oldmappi"/>
			<ip ref="128.214.20.212"/>
			<fs ref="p06-config"/>
			<fs ref="p06-default"/>
		</service>
		<service autostart="1" domain="p07" name="p07" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p07.oldmappi" name="initd.p07.oldmappi"/>
			<ip ref="128.214.20.213"/>
			<fs ref="p07-config"/>
			<fs ref="p07-default"/>
		</service>
		<service autostart="1" domain="p08" name="p08" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p08.oldmappi" name="initd.p08.oldmappi"/>
			<ip ref="128.214.20.214"/>
			<fs ref="p08-config"/>
			<fs ref="p08-default"/>
		</service>
		<service autostart="1" domain="p09" name="p09" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p09.oldmappi" name="initd.p09.oldmappi"/>
			<ip ref="128.214.20.215"/>
			<fs ref="p09-config"/>
			<fs ref="p09-default"/>
		</service>
		<service autostart="1" domain="p10" name="p10" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p10.oldmappi" name="initd.p10.oldmappi"/>
			<ip ref="128.214.20.216"/>
			<fs ref="p10-config"/>
			<fs ref="p10-default"/>
		</service>
		<service autostart="1" domain="p11" name="p11" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p11.oldmappi" name="initd.p11.oldmappi"/>
			<ip ref="128.214.20.217"/>
			<fs ref="p11-config"/>
			<fs ref="p11-default"/>
		</service>
		<service autostart="1" domain="p12" name="p12" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.p12.oldmappi" name="initd.p12.oldmappi"/>
			<ip ref="128.214.20.218"/>
			<fs ref="p12-config"/>
			<fs ref="p12-default"/>
		</service>
		<service autostart="1" domain="i01" name="i01" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i01.master" name="initd.i01.master"/>
			<ip ref="128.214.20.139"/>
			<fs ref="i01-config"/>
			<fs ref="aaal"/>
		</service>
		<service autostart="1" domain="i02" name="i02" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i02.master" name="initd.i02.master"/>
			<ip ref="128.214.20.140"/>
			<fs ref="i02-config"/>
			<fs ref="amav"/>
		</service>
		<service autostart="1" domain="i03" name="i03" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i03.master" name="initd.i03.master"/>
			<ip ref="128.214.20.141"/>
			<fs ref="i03-config"/>
			<fs ref="awei"/>
		</service>
		<service autostart="1" domain="i04" name="i04" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i04.master" name="initd.i04.master"/>
			<ip ref="128.214.20.142"/>
			<fs ref="i04-config"/>
			<fs ref="ejga"/>
		</service>
		<service autostart="1" domain="i05" name="i05" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i05.master" name="initd.i05.master"/>
			<ip ref="128.214.20.143"/>
			<fs ref="i05-config"/>
			<fs ref="gbhk"/>
		</service>
		<service autostart="1" domain="i06" name="i06" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i06.master" name="initd.i06.master"/>
			<ip ref="128.214.20.144"/>
			<fs ref="i06-config"/>
			<fs ref="hlhz"/>
		</service>
		<service autostart="1" domain="i07" name="i07" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i07.master" name="initd.i07.master"/>
			<ip ref="128.214.20.145"/>
			<fs ref="i07-config"/>
			<fs ref="iaji"/>
		</service>
		<service autostart="1" domain="i08" name="i08" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i08.master" name="initd.i08.master"/>
			<ip ref="128.214.20.146"/>
			<fs ref="i08-config"/>
			<fs ref="jjjr"/>
		</service>
		<service autostart="1" domain="i09" name="i09" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i09.master" name="initd.i09.master"/>
			<ip ref="128.214.20.147"/>
			<fs ref="i09-config"/>
			<fs ref="jskd"/>
		</service>
		<service autostart="1" domain="i10" name="i10" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i10.master" name="initd.i10.master"/>
			<ip ref="128.214.20.148"/>
			<fs ref="i10-config"/>
			<fs ref="keko"/>
		</service>
		<service autostart="1" domain="i11" name="i11" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i11.master" name="initd.i11.master"/>
			<ip ref="128.214.20.149"/>
			<fs ref="i11-config"/>
			<fs ref="kplh"/>
		</service>
		<service autostart="1" domain="i12" name="i12" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i12.master" name="initd.i12.master"/>
			<ip ref="128.214.20.150"/>
			<fs ref="i12-config"/>
			<fs ref="lilz"/>
		</service>
		<service autostart="1" domain="i13" name="i13" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i13.master" name="initd.i13.master"/>
			<ip ref="128.214.20.165"/>
			<fs ref="i13-config"/>
			<fs ref="mami"/>
		</service>
		<service autostart="1" domain="i14" name="i14" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i14.master" name="initd.i14.master"/>
			<ip ref="128.214.20.166"/>
			<fs ref="i14-config"/>
			<fs ref="mjmp"/>
		</service>
		<service autostart="1" domain="i15" name="i15" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i15.master" name="initd.i15.master"/>
			<ip ref="128.214.20.167"/>
			<fs ref="i15-config"/>
			<fs ref="mqnj"/>
		</service>
		<service autostart="1" domain="i16" name="i16" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i16.master" name="initd.i16.master"/>
			<ip ref="128.214.20.168"/>
			<fs ref="i16-config"/>
			<fs ref="nkpd"/>
		</service>
		<service autostart="1" domain="i17" name="i17" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i17.master" name="initd.i17.master"/>
			<ip ref="128.214.20.169"/>
			<fs ref="i17-config"/>
			<fs ref="peps"/>
		</service>
		<service autostart="1" domain="i18" name="i18" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i18.master" name="initd.i18.master"/>
			<ip ref="128.214.20.170"/>
			<fs ref="i18-config"/>
			<fs ref="ptro"/>
		</service>
		<service autostart="1" domain="i19" name="i19" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i19.master" name="initd.i19.master"/>
			<ip ref="128.214.20.171"/>
			<fs ref="i19-config"/>
			<fs ref="rpsi"/>
		</service>
		<service autostart="1" domain="i20" name="i20" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i20.master" name="initd.i20.master"/>
			<ip ref="128.214.20.172"/>
			<fs ref="i20-config"/>
			<fs ref="sjss"/>
		</service>
		<service autostart="1" domain="i21" name="i21" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i21.master" name="initd.i21.master"/>
			<ip ref="128.214.20.173"/>
			<fs ref="i21-config"/>
			<fs ref="stth"/>
		</service>
		<service autostart="1" domain="i22" name="i22" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i22.master" name="initd.i22.master"/>
			<ip ref="128.214.20.174"/>
			<fs ref="i22-config"/>
			<fs ref="tits"/>
		</service>
		<service autostart="1" domain="i23" name="i23" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i23.master" name="initd.i23.master"/>
			<ip ref="128.214.20.175"/>
			<fs ref="i23-config"/>
			<fs ref="ttvi"/>
		</service>
		<service autostart="1" domain="i24" name="i24" recovery="restart">
			<script file="/etc/init.d/cyrus-imapd.i24.master" name="initd.i24.master"/>
			<ip ref="128.214.20.176"/>
			<fs ref="i24-config"/>
			<fs ref="vjzz"/>
		</service>
	</rm>
</cluster>
--clip--

> 
> Can you run:
> 
> rg_test rules
> 
> ...and attach the output?  Maybe there's something locale-based that's
> causing it to misbehave (but I can't imagine what).

Well, it looks good enough:

--clip--
Running in rules mode.
Entity: line 2: parser error : Premature end of data in tag err line 1

^
Loaded 18 resource rules
Resource Rules for "SAPInstance"
Agent: SAPInstance
Attributes:
  InstanceName [ primary unique required default="" ]
  DIR_EXECUTABLE [ unique default="" ]
  DIR_PROFILE [ unique default="" ]
  START_PROFILE [ unique default="" ]
Actions:
  start
    Timeout (hint): 180 seconds
  stop
    Timeout (hint): 240 seconds
  status
    Timeout (hint): 60 seconds
  monitor
    Timeout (hint): 60 seconds
    Check Interval: 120 seconds
  validate-all
    Timeout (hint): 5 seconds
  meta-data
    Timeout (hint): 5 seconds
  methods
    Timeout (hint): 5 seconds
Explicitly defined child resource types:
  - None -

Resource Rules for "lvm"
Max instances: 1
Agent: lvm.sh
Attributes:
  name [ primary ]
  vg_name [ required ]
  lv_name [ required ]
  nfslock [ inherit ]
Actions:
  start
    Timeout (hint): 5 seconds
  stop
    Timeout (hint): 5 seconds
  status
    Timeout (hint): 5 seconds
    Check Interval: 3600 seconds
  monitor
    Timeout (hint): 5 seconds
    Check Interval: 3600 seconds
  meta-data
    Timeout (hint): 5 seconds
  verify-all
    Timeout (hint): 30 seconds
Explicitly defined child resource types:
  - None -

Resource Rules for "ip"
Max instances: 1
Agent: ip.sh
Attributes:
  address [ primary unique ]
  family
  monitor_link [ default="1" ]
  nfslock [ inherit ]
Actions:
  start
    Timeout (hint): 20 seconds
  stop
    Timeout (hint): 20 seconds
  status
    Timeout (hint): 10 seconds
    Check Interval: 20 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 20 seconds
  status
    Timeout (hint): 20 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 20 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 60 seconds
  meta-data
    Timeout (hint): 20 seconds
  verify-all
    Timeout (hint): 20 seconds
Explicitly defined child resource types:
  nfsclient (forbidden)
  nfsexport (forbidden)

Resource Rules for "vm"
Max instances: 1
Agent: vm.sh
Attributes:
  name [ primary ]
  domain
  autostart
  recovery
  memory
  bootloader
  path
  rootdisk_physical [ unique ]
  rootdisk_virtual
  swapdisk_physical [ unique ]
  swapdisk_virtual
  vif
Actions:
  start
    Timeout (hint): 20 seconds
  stop
    Timeout (hint): 120 seconds
  status
    Timeout (hint): 10 seconds
    Check Interval: 30 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 30 seconds
  reconfig
    Timeout (hint): 10 seconds
  migrate
    Timeout (hint): 600 seconds
  meta-data
    Timeout (hint): 5 seconds
  verify-all
    Timeout (hint): 5 seconds
Explicitly defined child resource types:
  - None -

Resource Rules for "clusterfs"
Agent: clusterfs.sh
Attributes:
  name [ primary ]
  mountpoint [ unique required ]
  device [ unique required ]
  fstype
  force_unmount
  options
  fsid
  nfslock [ inherit ]
Actions:
  start
    Timeout (hint): 900 seconds
  stop
    Timeout (hint): 30 seconds
  status
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 20 seconds
    Check Interval: 600 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 20 seconds
    Check Interval: 600 seconds
  meta-data
    Timeout (hint): 5 seconds
  verify-all
    Timeout (hint): 5 seconds
Explicitly defined child resource types:
  fs [ startlevel = 1 stoplevel = 3 ] 
  clusterfs [ startlevel = 1 stoplevel = 3 ] 
  nfsexport [ startlevel = 3 stoplevel = 1 ] 

Resource Rules for "script"
Agent: script.sh
Attributes:
  name [ primary unique ]
  file [ unique required ]
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Check Interval: 30 seconds
  monitor
    Check Interval: 30 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

Resource Rules for "openldap"
Agent: openldap.sh
Attributes:
  name [ primary ]
  config_file [ default="/etc/openldap/slapd.conf" ]
  url_list [ default="ldap:///" ]
  slapd_options
  shutdown_wait
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

Resource Rules for "service"
Max instances: 1
Agent: service.sh
Attributes:
  name [ primary unique required ]
  domain
  autostart
  hardrecovery
  exclusive
  nfslock
  recovery
  depend
Actions:
  start
    Timeout (hint): 5 seconds
  stop
    Timeout (hint): 5 seconds
  status
    Timeout (hint): 5 seconds
    Check Interval: 3600 seconds
  monitor
    Timeout (hint): 5 seconds
    Check Interval: 3600 seconds
  recover
    Timeout (hint): 5 seconds
  reload
    Timeout (hint): 5 seconds
  meta-data
    Timeout (hint): 5 seconds
  verify-all
    Timeout (hint): 5 seconds
Explicitly defined child resource types:
  lvm [ startlevel = 1 stoplevel = 9 ] 
  fs [ startlevel = 2 stoplevel = 8 ] 
  clusterfs [ startlevel = 3 stoplevel = 7 ] 
  netfs [ startlevel = 4 stoplevel = 6 ] 
  nfsexport [ startlevel = 5 stoplevel = 5 ] 
  nfsclient [ startlevel = 6 stoplevel = 4 ] 
  ip [ startlevel = 7 stoplevel = 2 ] 
  smb [ startlevel = 8 stoplevel = 3 ] 
  script [ startlevel = 9 stoplevel = 1 ] 

Resource Rules for "netfs"
Agent: netfs.sh
Attributes:
  name [ primary ]
  mountpoint [ unique required ]
  host [ required ]
  export [ required ]
  fstype
  force_unmount
  options
Actions:
  start
    Timeout (hint): 900 seconds
  stop
    Timeout (hint): 30 seconds
  status
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 20 seconds
    Check Interval: 600 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 20 seconds
    Check Interval: 600 seconds
  meta-data
    Timeout (hint): 5 seconds
  verify-all
    Timeout (hint): 5 seconds
Explicitly defined child resource types:
  nfsexport (forbidden)
  nfsclient (forbidden)

Resource Rules for "nfsexport"
Agent: nfsexport.sh
Attributes:
  name [ primary ]
  device [ inherit ]
  path [ inherit ]
  fsid [ inherit ]
Actions:
  start
    Timeout (hint): 5 seconds
  stop
    Timeout (hint): 5 seconds
  recover
    Timeout (hint): 5 seconds
  status
    Timeout (hint): 5 seconds
    Check Interval: 3600 seconds
  monitor
    Timeout (hint): 5 seconds
    Check Interval: 3600 seconds
  meta-data
    Timeout (hint): 5 seconds
  verify-all
    Timeout (hint): 30 seconds
Explicitly defined child resource types:
  nfsexport (forbidden)
  nfsclient

Resource Rules for "tomcat-5"
Agent: tomcat-5.sh
Attributes:
  name [ primary ]
  config_file [ default="/etc/tomcat5/tomcat5.conf" ]
  tomcat_user [ default="tomcat" ]
  catalina_options
  catalina_base [ default="/usr/share/tomcat5" ]
  shutdown_wait [ default="30" ]
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

Resource Rules for "samba"
Agent: samba.sh
Attributes:
  name [ primary unique ]
  config_file [ default="/etc/samba/smb.conf" ]
  smbd_options
  nmbd_options
  shutdown_wait
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Check Interval: 30 seconds
  monitor
    Check Interval: 30 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

Resource Rules for "mysql"
Agent: mysql.sh
Attributes:
  name [ primary ]
  config_file [ default="/etc/my.cnf" ]
  listen_address
  mysqld_options
  shutdown_wait
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

Resource Rules for "postgres-8"
Agent: postgres-8.sh
Attributes:
  name [ primary ]
  config_file [ default="/var/lib/pgsql/data/postgresql.conf" ]
  postmaster_user [ default="postgres" ]
  postmaster_options
  shutdown_wait
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

Resource Rules for "nfsclient"
Agent: nfsclient.sh
Attributes:
  name [ primary unique ]
  target [ required ]
  path [ inherit ]
  fsid [ inherit ]
  options
  allow_recover
Actions:
  start
    Timeout (hint): 90 seconds
  stop
    Timeout (hint): 5 seconds
  recover
    Timeout (hint): 90 seconds
  status
    Timeout (hint): 5 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 5 seconds
    Check Interval: 60 seconds
  meta-data
    Timeout (hint): 5 seconds
  verify-all
    Timeout (hint): 30 seconds
Explicitly defined child resource types:
  - None -

Resource Rules for "SAPDatabase"
Agent: SAPDatabase
Attributes:
  SID [ primary unique required default="" ]
  DIR_EXECUTABLE [ unique default="" ]
  DBTYPE [ unique required default="" ]
  NETSERVICENAME [ unique default="" ]
  DBJ2EE_ONLY [ unique default="false" ]
  DIR_BOOTSTRAP [ unique default="" ]
  DIR_SECSTORE [ unique default="" ]
Actions:
  start
    Timeout (hint): 1800 seconds
  stop
    Timeout (hint): 1800 seconds
  status
    Timeout (hint): 60 seconds
  monitor
    Timeout (hint): 60 seconds
    Check Interval: 120 seconds
  validate-all
    Timeout (hint): 5 seconds
  meta-data
    Timeout (hint): 5 seconds
  methods
    Timeout (hint): 5 seconds
Explicitly defined child resource types:
  - None -

Resource Rules for "apache"
Agent: apache.sh
Attributes:
  name [ primary ]
  server_root [ default="/etc/httpd" ]
  config_file [ default="conf/httpd.conf" ]
  httpd_options
  shutdown_wait
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  monitor
    Timeout (hint): 10 seconds
    Check Interval: 60 seconds
  status
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  monitor
    Timeout (hint): 30 seconds
    OCF Check Depth (status/monitor): 10 seconds
    Check Interval: 300 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

Resource Rules for "smb"
Agent: smb.sh
Attributes:
  name [ primary unique ]
  workgroup [ default="LINUXCLUSTER" ]
  service_name [ inherit ]
Actions:
  start
  stop
  status
    Check Interval: 30 seconds
  monitor
    Check Interval: 30 seconds
  meta-data
  verify-all
Explicitly defined child resource types:
  - None -

--clip--

> The only way I can imagine this happening is:
> (a) for some reason, the intervals for fs.sh checks are being set to 0
> (rg_test should see this) *and* 
> 
> (b) for some other reason, the status check queueing is occurring
> constantly instead of every 10 seconds like it should
> 
> I can prevent (a) from happening by not letting it be less than 1; but
> that doesn't explain why you're seeing what you are seeing.
> 
> I'll be updating packages again probably later today or tomorrow, but
> I don't think any of the fixed bugs are related to the fs.sh problem
> you're experiencing at this point.



-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From chris at cmiware.com  Tue Jul 31 15:00:39 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 31 Jul 2007 10:00:39 -0500
Subject: [Linux-cluster] fence_apc failure
In-Reply-To: <46AF48F4.8090701@redhat.com>
References: <46AE6DC3.1060807@cmiware.com> <46AE7383.6090706@cmiware.com>
	<46AF48F4.8090701@redhat.com>
Message-ID: <46AF4E97.9090201@cmiware.com>

The last error I reported where the node was actually powered off, but 
the cluster thought it failed was with the CVS python script. 

 > /sbin/fence_apc -V
fence_apc New APC Agent - test release  September 21, 2006

We noticed the asterisk as well and thought that might be problematic.  
A co-worker has hacked on the python script some and reports that it now 
functions properly from the command line.  I have yet to begin testing 
with the cluster though.

The Perl script (renamed to test the python script) reports:

 > /sbin/fence_apc.pl.bak -V
fence_apc.pl.bak 2.0.64 (built Mon Jun 25 14:34:20 EDT 2007)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.

I'd be happy to test for you, since that is what I'll be doing anyway.

Thanks,
Chris

James Parsons wrote:
> Chris,
>
> If you run /sbin/fence_apc -V, and the response is "fence_apc 
> 1.32.45", then you are using the latest
> apc agent that has been updated to handle the 3.x apc firmware.
>
> If you are running the latest agent, then you are obviously 
> encountering a problem. It is possible that it has something to do 
> with the asterisk that is appearing after an outlet group. Please let 
> me know if you are running 1.32.45...this should be able to be fixed 
> today, if you can help with some testing.
>
> Some background on the fence_apc agent for those who care:
> For years, the apc agent was written in perl. As the apc firmware 
> evolved from release to release, the regexp match patterns for screen 
> scraping the telnet session grew uglier and uglier. In addition, the 
> perl agent did not support outlet naming or grouping, and did not 
> handle the larger switches that took 2 or more screens to list out the 
> available outlets. To add these features and to make the agent easier 
> to maintain, it was decided to rewrite the agent in python. This was 
> done while apc firmware was still in the 2.x series.
>
> After some errors with MasterSwitchPlus switches were fixed, the agent 
> worked well in the field...until version 3.x firmware was released. 
> This firmware release changed ALOT of things, including screen order.
>
> The current agent version worked well on my older apc switches as well 
> as the ones with newer firmware, so I released it into the beta and 
> plan on releasing it as an async errata release for RHEL4.5 Cluster 
> Suite.
>
> I will try and reproduce this issue this morning and have something 
> for you to test today.
>
> By the way, there is also a fence_apc_snmp agent. It works great, but 
> we have not switched to using it exclusively yet, because some admins 
> don't like having snmp packages on their systems - but with the pain 
> that trying to maintain the telnet version of this agent is causing, 
> it is making me lean more and more towards including just one apc 
> solution - snmp. :-/
>
> Thanks for your patience,
>
> -Jim
>
> Chris Harms wrote:
>
>> This appears to be a "no."
>> fence_node[26827]: agent "fence_apc" reports: Power Off 
>> unsuccessfulStatus check successful. Port 4 is ON
>>
>> However the node was powered off so something worked.  Also, it 
>> appears to have sent an Off command instead of Reboot.
>>
>>
>> Chris Harms wrote:
>>
>>> We now have some APC 7931 units at our disposal, however the 
>>> fence_apc perl script fails with "unrecognized menu response."  This 
>>> appears to be from new firmware on the APC units.  There is a python 
>>> script in CVS that looks like it may operate correctly with the new 
>>> firmware menus, is this correct?  Also, is it a drop-in replacement 
>>> for the perl script, i.e. will saving it as /sbin/fence_apc work 
>>> with RHCS 5?
>>>
>>> Thanks,
>>> Chris
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From maciej.bogucki at artegence.com  Tue Jul 31 15:08:36 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Tue, 31 Jul 2007 17:08:36 +0200
Subject: [Linux-cluster] GFS i/o Request size
In-Reply-To: <OFCE3CCC49.E35DCEFE-ON85257326.006E1D6C-85257326.006E2A29@csc.com>
References: <OFCE3CCC49.E35DCEFE-ON85257326.006E1D6C-85257326.006E2A29@csc.com>
Message-ID: <46AF5074.5050805@artegence.com>

Mahmoud Hanafi napisa?(a):
> 
> 
> It seem that GFS likes to do all I/O in 512KB sizes. Is there a way to
> increase this? I like to do I/O sizes of 1024 or 2048KB.
> 
Hello,

First of all linux kernel breaks IO into pages, and the default page
size is 4K. Second You have to set blocksize for filesystem and default
for GFS is 4096(65536 maximum). Third You have segment size for You SAN
storage - for IBM DS4700 You can set from 8KB to 512KB, and here for
every I/O storage read/write this size od data. If it is clear for You,
which value You want to tune?

Best Regards
Maciej Bogucki




From sebastian.walter at fu-berlin.de  Tue Jul 31 15:26:30 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 31 Jul 2007 17:26:30 +0200
Subject: [Linux-cluster] Question about GFS Resource/Service
In-Reply-To: <46AF5074.5050805@artegence.com>
References: <OFCE3CCC49.E35DCEFE-ON85257326.006E1D6C-85257326.006E2A29@csc.com>
	<46AF5074.5050805@artegence.com>
Message-ID: <46AF54A6.3040302@fu-berlin.de>

Dear list,

what is the purpose of the gfs service/resource? Is it about to mount
the gfs volume on all the nodes or just the owner node? In my
configuration, the volume is only mounted on the owner. Maybe I'm
missing something... Thanks for any answer!

Regards,
Sebastian



From jleafey at utmem.edu  Tue Jul 31 15:48:44 2007
From: jleafey at utmem.edu (Jay Leafey)
Date: Tue, 31 Jul 2007 10:48:44 -0500
Subject: [Linux-cluster] Odd cluster problems
Message-ID: <46AF59DC.9000906@utmem.edu>

I've got a 3-node cluster running CentOS 4.5 and I cannot communicate 
with the resource group manager.  When I use the clustat command I get a 
timeout:

> [root at rapier ~]# clustat
> Timed out waiting for a response from Resource Group Manager
> Member Status: Quorate
> 
>   Member Name                              Status
>   ------ ----                              ------
>   rapier.utmem.edu                         Online, Local, rgmanager
>   thorax.utmem.edu                         Offline
>   cyclops.utmem.edu                        Online, rgmanager

I've got rgmanager 1.9.68-1 installed, along with the following 
"relevant" packages:

kernel-2.6.9-55.EL.x86_64
ccs-1.0.10-0.x86_64
cman-1.0.17-0.x86_64
cman-kernel-2.6.9-50.2.x86_64
dlm-1.0.3-1.x86_64
dlm-kernel-2.6.9-46.16.x86_64
fence-1.32.45-1.0.1.x86_64
GFS-6.1.14-0.x86_64
GFS-kernel-2.6.9-72.2.x86_64
gulm-1.0.10-0.x86_64
lvm2-cluster-2.02.21-7.el4.x86_64
magma-1.0.7-1.x86_64
magma-plugins-1.0.12-0.x86_64
rgmanager-1.9.68-1.x86_64
system-config-cluster-1.0.45-1.0.noarch

I checked the archives and saw similar reports, but they all seem to 
reference an older version of rgmanager.

I did some poking around and there is one service (show by cman_tool 
services) shown in a state other than "run", the "usrm::manager" 
service.  Here's the anomalous output:

> [root at rapier ~]# cman_tool services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           2   2 recover 4 -
> [1 2]
> 
> <SNIP>
> 
> User:            "usrm::manager"                    10  10 recover 2 -
> [1 2]
> 

The services handled by rgmanager are all running, but any attempt to 
update the cluster.conf file via ccs_tool update 
"/etc/cluster/cluster.conf" is ineffective.  The file gets updated, but 
the config version shown by "cman_tool status" does not change.

Any thought on how to proceed with troubleshooting this?
-- 
Jay Leafey - University of Tennessee
E-Mail:  jleafey at utmem.edu  Phone:  901-448-6534  FAX:  901-448-8199

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5158 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/f583582b/attachment.bin>

From filipe.miranda at gmail.com  Tue Jul 31 15:52:38 2007
From: filipe.miranda at gmail.com (Filipe Miranda)
Date: Tue, 31 Jul 2007 12:52:38 -0300
Subject: [Linux-cluster] VF: Abort: Invalid header in reply from member #1
Message-ID: <a6d13c780707310852g685bd07cv98942edc46e35c25@mail.gmail.com>

Hello everybody,

I am using RedHatCluster Suite for RHEL3 and I am experiencing the following
errors in the cluster's log file:

Jan 11 11:33:50 node2 clumembd[15717]: <notice> Member cluster1-node1 UP
Jan 11 11:33:51 node2 clumembd[15717]: <info> Membership View #2:0x00000003

Jan 11 11:33:52 node2 cluquorumd[15704]: <notice> Quorum Formed; Starting
Service Manager
Jan 11 11:33:52 node2 cluquorumd[16055]: <err> VF: Abort: Invalid header in
reply from member #1
Jan 11 11:33:52 node2 clusvcmgrd[16057]: <info> Quorum Event: View #2
0x00000003
Jan 11 11:33:52 node2 clusvcmgrd[16057]: <info> State change: Local UP
Jan 11 11:33:52 node2 clusvcmgrd[16057]: <info> Initializing services
Jan 11 11:36:48 node2 clusvcmgrd[16057]: <warning> Starvation on Lock #0!
Jan 11 11:39:44 node2 clusvcmgrd[16057]: <warning> Starvation on Lock #0!

Its a 2 node cluster, and I already reinitialized the quorum disk but the
problem persists.
Whenever I reboot the machines the rgmanager wont start any cluster service.

This is what is shown inthe node1 of the cluster at the same time the node2
logs the messages above:

Jan 11 11:28:40 patmos clumembd[15899]: <notice> Member cluster1-node1 UP
Jan 11 11:28:41 patmos clumembd[15899]: <info> Membership View #1:0x00000001

Jan 11 11:33:50 patmos clumembd[15899]: <notice> Member cluster2-node2 UP
Jan 11 11:33:50 patmos clumembd[15899]: <info> Membership View #2:0x00000003


Is there any suggestions on whyis this happening?
Any help is appreciated

Thank you,

Regards,
-- 
---

Filipe Miranda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/27e5b292/attachment.htm>

From dist-list at LEXUM.UMontreal.CA  Tue Jul 31 16:03:02 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Tue, 31 Jul 2007 12:03:02 -0400
Subject: [Linux-cluster] LVS redundancy server and network type: DIRECT
Message-ID: <46AF5D36.9090304@lexum.umontreal.ca>

Hello,
I'm trying to create a LVS redundancy server with redundacy server.

The problem is LVS1 moved the virtual address to the LVS2 server. But
the real server, using arptables still have the LVS1 ip address in the
arptables_jf rule.

is it possible to use redundancy server with direct routing ?

tx!



From benjamin.jakubowski at gmail.com  Tue Jul 31 16:37:21 2007
From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski)
Date: Tue, 31 Jul 2007 18:37:21 +0200
Subject: [Linux-cluster] RHCS on RedHat As 4 u4
Message-ID: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>

Hi

Do you know if possible to install RedHat Cluster Suite on RedHat AS 4 U 4 ?

Thanks a lot

Benjamin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/7eb3af2c/attachment.htm>

From breeves at redhat.com  Tue Jul 31 16:38:59 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Tue, 31 Jul 2007 17:38:59 +0100
Subject: [Linux-cluster] RHCS on RedHat As 4 u4
In-Reply-To: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>
References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>
Message-ID: <46AF65A3.1010806@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Benjamin Jakubowski wrote:
> Hi
> 
> Do you know if possible to install RedHat Cluster Suite on RedHat AS 4 U 4 ?
> 
> Thanks a lot
> 
> Benjamin
> 

Certainly is - you'll find it in the Red Hat Cluster Suite channel on
RHN, although if you're planning a new deployment, you may prefer to go
with the more up-to-date RHEL4.5.

Kind regards,
Bryn.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGr2Wj6YSQoMYUY94RApS1AJ9rdPXuZQFKDvdwl4kqV2+cLvj4vQCgltuV
ADL5+0bxYsX/xLVT6CJKDlc=
=8TRg
-----END PGP SIGNATURE-----



From benjamin.jakubowski at gmail.com  Tue Jul 31 16:56:01 2007
From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski)
Date: Tue, 31 Jul 2007 18:56:01 +0200
Subject: [Linux-cluster] RHCS on RedHat As 4 u4
In-Reply-To: <46AF65A3.1010806@redhat.com>
References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>
	<46AF65A3.1010806@redhat.com>
Message-ID: <6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com>

The probleme is :
i need to preserve the kernel version 2.6.9-42.ELsmp, to preserve a SAN
compatibility and in RHN, there isn't have a cluster kernel module ?
do u have any idea ?

Thanks a lot
Benjamin

2007/7/31, Bryn M. Reeves <breeves at redhat.com>:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Benjamin Jakubowski wrote:
> > Hi
> >
> > Do you know if possible to install RedHat Cluster Suite on RedHat AS 4 U
> 4 ?
> >
> > Thanks a lot
> >
> > Benjamin
> >
>
> Certainly is - you'll find it in the Red Hat Cluster Suite channel on
> RHN, although if you're planning a new deployment, you may prefer to go
> with the more up-to-date RHEL4.5.
>
> Kind regards,
> Bryn.
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
>
> iD8DBQFGr2Wj6YSQoMYUY94RApS1AJ9rdPXuZQFKDvdwl4kqV2+cLvj4vQCgltuV
> ADL5+0bxYsX/xLVT6CJKDlc=
> =8TRg
> -----END PGP SIGNATURE-----
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
@+
Benj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/b1f622bb/attachment.htm>

From janne.peltonen at helsinki.fi  Tue Jul 31 16:11:12 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Tue, 31 Jul 2007 19:11:12 +0300
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731121438.GA21896@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
Message-ID: <20070731161112.GF21896@helsinki.fi>

On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote:
> > On Fri, Jul 06, 2007 at 09:36:59PM +0300, Janne Peltonen wrote:
> > > On Fri, Jul 06, 2007 at 02:31:51PM -0400, Lon Hohberger wrote:
> > > > > > I forgot what this was... could you just mail me your original email
> > > > > > off-list?
> > > > > Nevermind, I found it
> > > > > https://www.redhat.com/archives/linux-cluster/2007-June/msg00115.html
> > > > I think it's actually the same problem as the 'status checks' being
> > > > wrong in 2.0.24; just a different symptom.

Now, as you mentioned it - I see no more status check lines in my
/var/log/messages, but status checking seems to work (I just turned off
a service manually, and clurgmgrd noticed it and restarted the service).
This is probably by design, isn't it?


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From breeves at redhat.com  Tue Jul 31 17:02:28 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Tue, 31 Jul 2007 18:02:28 +0100
Subject: [Linux-cluster] RHCS on RedHat As 4 u4
In-Reply-To: <6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com>
References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>	<46AF65A3.1010806@redhat.com>
	<6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com>
Message-ID: <46AF6B24.9000208@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Benjamin Jakubowski wrote:
> The probleme is :
> i need to preserve the kernel version 2.6.9-42.ELsmp, to preserve a SAN
> compatibility and in RHN, there isn't have a cluster kernel module ?
> do u have any idea ?
> 
> Thanks a lot
> Benjamin

OK, if you need to match with a specific kernel release you'll need to
use the Cluster Suite packages that were released at the same time.

For U4 (2.6.9-42.EL), I think those would be:

cman-kernel-2.6.9-45.2
cman-kernel-smp-2.6.9-45.2
cman-kernel-hugemem-2.6.9-45.2
dlm-kernel-hugemem-2.6.9-42.10
dlm-kernel-2.6.9-42.10
dlm-kernel-smp-2.6.9-42.10

If you also need the packages for GFS, those are:

GFS-kernel-2.6.9-58.0
GFS-kernel-smp-2.6.9-58.0
GFS-kernel-hugemem-2.6.9-58.0
gnbd-kernel-hugemem-2.6.9-9.41
gnbd-kernel-2.6.9-9.41
gnbd-kernel-smp-2.6.9-9.41

Those packages are still available on RHN - just go to the web
interface, hit search by package & they should appear there.

Be aware though that there were some important kernel related bugfixes
applied after U4 - particularly for systems using multipath SAN storage.

Kind regards,
Bryn.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGr2sk6YSQoMYUY94RAp+qAJ9yKqpXxmFXqQzRc//0RVFzavPdzgCgk+YE
YJFtx2iQm4zdYnKXQ/QmRHY=
=s4BM
-----END PGP SIGNATURE-----



From janne.peltonen at helsinki.fi  Tue Jul 31 17:11:23 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Tue, 31 Jul 2007 20:11:23 +0300
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731145441.GE21896@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
	<20070731145441.GE21896@helsinki.fi>
Message-ID: <20070731171123.GH21896@helsinki.fi>

On Tue, Jul 31, 2007 at 05:54:41PM +0300, Janne Peltonen wrote:
> On Tue, Jul 31, 2007 at 09:41:21AM -0400, Lon Hohberger wrote:
> > On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> > > On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote:
> > > > 
> > > > http://people.redhat.com/lhh/rhel5-test
> > > > 
> > > > You'll need at least the updated cman package.  The -2.1lhh build of
> > > > rgmanager is the one I just built today; the others are a bit older.
> > > 
> > > Well, I installed the new versions of the cman and rgmanager packages I
> > > found there, but to no avail: I still get 1500 invocations of fs.sh per
> > > second.
> > 
> > I put a log message in fs.sh:
> > 
> > Jul 31 09:27:29 bart clurgmgrd: [4395]: <err> /usr/share/cluster/fs.sh
> > TEST 
> > 
> > It comes up once every several (10-20) seconds like it's supposed to. 
> 
> I did the same, with the same results. It seems to me that the clurgmgrd
> process isn't calling the complete script any more times than it's
> supposed to.  What I'm seeing are the execs of fs.sh, that is, it
> includes each () and `` and so on. Each fs.sh invocation seems to create
> quite an amount of subshells.
> 
> I'm sorry for having misled you. And this all means, there isn't
> probably much reason to read the cluster.conf and rg_test rules output -
> I'll attach them anyway.

After running the new rgmanager packages for abt four hours without any
of the load fluctuation I'd experienced before (with a more-or-less
four-hour interval, system load first increases slowly until it reaches
a high level - dependent on overall system load - and then swiftly
decreases to near zero, to start increasing again. This fluctuation
peaks at about 5.0 in a system with no users at all, but many services.
If there are many users and the user peak coincides with the base peak,
the system experiences a shortish load peak of abt 100.0, after which it
recovers and the basic load fluctuation becomes visible again). Then the
load averages started increasing again, to something 10.0ish, so -
frustrated - I edited /usr/share/cluster/fs.sh and put an exit 0 to the
switch-case "status|monitor" on $1. Well. Load averages promptly fell
back to under 0.5, disk usage% fell by 30 %-units, and overall system
responsiveness increased considerably.

So I'll be running my cluster without fs status checks for now. I hope
someone'll work out what's wrong with fs.sh soon... ;)


--Janne



From mgrac at redhat.com  Tue Jul 31 18:24:09 2007
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Tue, 31 Jul 2007 20:24:09 +0200
Subject: [Linux-cluster] LVS redundancy server and network type: DIRECT
In-Reply-To: <46AF5D36.9090304@lexum.umontreal.ca>
References: <46AF5D36.9090304@lexum.umontreal.ca>
Message-ID: <46AF7E49.6050201@redhat.com>

Hi,

FM wrote:
> Hello,
> I'm trying to create a LVS redundancy server with redundacy server.
>
> The problem is LVS1 moved the virtual address to the LVS2 server. But
> the real server, using arptables still have the LVS1 ip address in the
> arptables_jf rule.
>
> is it possible to use redundancy server with direct routing ?
>   
Yes, it is possible. Direct routing is described in manual:

http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Virtual_Server_Administration/index.html


marx,

-- 
Marek Grac
Red Hat Czech s.r.o.



From mhanafi at csc.com  Tue Jul 31 19:58:17 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Tue, 31 Jul 2007 15:58:17 -0400
Subject: [Linux-cluster] GFS Scaling
Message-ID: <OFB617E5FF.F6750305-ON85257329.006CC611-85257329.006DB9DE@csc.com>

This may have been ask before.....

How large can a GFS cluster scale? Can I scale up to 128 or 256 nodes? I 
have read some thing about a 16 node limit? Whats better lock_dlm or 
lock_glum?

Regards,
Mahmoud
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/16d9d899/attachment.htm>

From mhanafi at csc.com  Tue Jul 31 20:07:24 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Tue, 31 Jul 2007 16:07:24 -0400
Subject: [Linux-cluster] GFS i/o Request size
In-Reply-To: <46AF5074.5050805@artegence.com>
Message-ID: <OF7E54B806.9937D037-ON85257329.006DE130-85257329.006E8F3F@csc.com>

This seem to be an issue with the el5 . On el4 I can set the 
/sys/block/sd*/queue/max_sector_kb for the block device up 4096KB. Using 
xdd writing directly to the device I can achive I/O request sizes of 
4096KB. But doing the same thing on el5 the max I can get is 512kb. I know 
there has been changes to the qla2400 drivers from el4 to el5. I am not 
sure if this limit is at the driver level or not. One other thing to note 
on el5 the default value for max_sector_kb is always set to a max of 
512KB. But on el4 it gets set to 4096KB which is the block size of the 
luns.

-Mahmoud


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------




Maciej Bogucki <maciej.bogucki at artegence.com> 
07/31/2007 11:08 AM

To
linux clustering <linux-cluster at redhat.com>, Mahmoud Hanafi/DEF/CSC at CSC
cc

Subject
Re: [Linux-cluster] GFS i/o Request size






Mahmoud Hanafi napisa?(a):
> 
> 
> It seem that GFS likes to do all I/O in 512KB sizes. Is there a way to
> increase this? I like to do I/O sizes of 1024 or 2048KB.
> 
Hello,

First of all linux kernel breaks IO into pages, and the default page
size is 4K. Second You have to set blocksize for filesystem and default
for GFS is 4096(65536 maximum). Third You have segment size for You SAN
storage - for IBM DS4700 You can set from 8KB to 512KB, and here for
every I/O storage read/write this size od data. If it is clear for You,
which value You want to tune?

Best Regards
Maciej Bogucki



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070731/66cf1038/attachment.htm>

From dist-list at LEXUM.UMontreal.CA  Tue Jul 31 20:28:31 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Tue, 31 Jul 2007 16:28:31 -0400
Subject: [Linux-cluster] LVS redundancy server and network type: DIRECT
In-Reply-To: <46AF7E49.6050201@redhat.com>
References: <46AF5D36.9090304@lexum.umontreal.ca> <46AF7E49.6050201@redhat.com>
Message-ID: <46AF9B6F.7080300@lexum.umontreal.ca>

Tx for the reply,
I re read the doc and my  question remains :-)
ex :
from the RH documentation :
Create the ARP table entries for each virtual IP address on each real
server (the real_ip is the IP the director uses to communicate with the
real server; often this is the IP bound to eth0):
arptables -A IN -d <virtual_ip> -j DROP
arptables -A OUT -d <virtual_ip> -j mangle --mangle-ip-s <real_ip>


If I create a redundancy server, and if the master server goes down, the
backup server will create all the <virtual_ip> but not the <real_ip> so
all the real servers still have the arptables setting to modify the
source of the IP packet to look likes the master LVS server that is down
now.


Marek 'marx' Grac wrote:
> Hi,
>
> FM wrote:
>> Hello,
>> I'm trying to create a LVS redundancy server with redundacy server.
>>
>> The problem is LVS1 moved the virtual address to the LVS2 server. But
>> the real server, using arptables still have the LVS1 ip address in the
>> arptables_jf rule.
>>
>> is it possible to use redundancy server with direct routing ?
>>   
> Yes, it is possible. Direct routing is described in manual:
>
> http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Virtual_Server_Administration/index.html
>
>
>
> marx,
>