From janne.peltonen at helsinki.fi Sun Jul 1 11:17:48 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Sun, 1 Jul 2007 14:17:48 +0300 Subject: [Linux-cluster] Rgmanager fails to restart Message-ID: <20070701111748.GA9103@helsinki.fi> Hi! Sometimes, when I have cleanly shut down rgmanager on one node, and the services have nicely migrated to other nodes, trying to start rgmanager fails. Trying to access /dev/misc/dlm_rgmanager results in "No such device". clurgmgrd concludes that locks are not working and exits. (See strace output attached.) --cut-- [jmmpelto at pcn1 ~]$ sudo service rgmanager start Starting Cluster Service Manager: [ OK ] [jmmpelto at pcn1 ~]$ sudo service rgmanager status clurgmgrd dead but pid file exists --cut-- Trying to stop cman fails: --clip-- [jmmpelto at pcn1 ~]$ sudo service cman restart Stopping cluster: Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] --clip-- And indeed, the rgmanager that isn't there is there: --clip-- [jmmpelto at pcn1 ~]$ sudo cman_tool services type level name id state fence 0 default 00010001 none [1 2 3 4 100] dlm 1 clvmd 00010002 none [1 2 3 4 100] dlm 1 rgmanager 00020002 none [1 2 3 4] --clip-- If I say 'cman_tool leave force', it succeeds. But if I then try starting the cluster: --cut-- [jmmpelto at pcn1 ~]$ sudo service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... failed [FAILED] --cut-- Log (oops, I forgot to shut down clvmd there... it would have gone down cleanly): --cut-- Jul 1 14:11:02 pcn1.mappi.helsinki.fi ccsd[4427]: Initial status:: Inquorate Jul 1 14:11:28 pcn1.mappi.helsinki.fi groupd[557]: found uncontrolled kernel object rgmanager in /sys/kernel/dlm Jul 1 14:11:28 pcn1.mappi.helsinki.fi groupd[557]: found uncontrolled kernel object clvmd in /sys/kernel/dlm Jul 1 14:11:28 pcn1.mappi.helsinki.fi groupd[557]: local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm Jul 1 14:11:28 pcn1.mappi.helsinki.fi fenced[568]: cman_init error 0 111 Jul 1 14:11:28 pcn1.mappi.helsinki.fi dlm_controld[576]: cman_init error 0 111 Jul 1 14:11:28 pcn1.mappi.helsinki.fi gfs_controld[583]: cman_init error 111 --cut-- Thereafter, one of the other nodes fences this one: --cut-- Jul 1 14:11:50 pcn1.mappi.helsinki.fi init: Switching to runlevel: 0 Jul 1 14:11:50 pcn1.mappi.helsinki.fi ccsd[4427]: Unable to connect to cluster infrastructure after 30 seconds. Jul 1 14:11:52 pcn1.mappi.helsinki.fi rgmanager: [667]: Cluster Service Manager is stopped. --cut-- (Now I wonder where that rgmanager log line came from? It isn't from any clurgmgrd, I checked with ps that there were none running.) Any ideas? (version of relevant packages: lvm2-2.02.16-3.el5 cman-2.0.60-1.el5 rgmanager-2.0.23-1.el5.centos ) --Janne -------------- next part -------------- execve("/usr/sbin/clurgmgrd", ["clurgmgrd"], [/* 17 vars */]) = 0 brk(0) = 0xc4f2000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaab000 uname({sys="Linux", node="pcn1.mappi.helsinki.fi", ...}) = 0 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=45165, ...}) = 0 mmap(NULL, 45165, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2aaaaaaac000 close(3) = 0 open("/usr/lib64/libxml2.so.2", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \262B_<\0\0\0@\0\0\0\0\0\0\0\260\303\23\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0 at _<\0\0\0\0\0 at _<\0\0\0\24\"\23\0\0\0\0\0\24\"\23\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0\0000\23\0\0\0\0\0\0000s_<\0\0\0\000"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1297136, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaab8000 mmap(0x3c5f400000, 3395256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5f400000 mprotect(0x3c5f533000, 2097152, PROT_NONE) = 0 mmap(0x3c5f733000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x133000) = 0x3c5f733000 mmap(0x3c5f73c000, 3768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c5f73c000 close(3) = 0 open("/lib64/libpthread.so.0", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20W\300Y<\0\0\0@\0\0\0\0\0\0\0\330\35\2\0\0\0\0\0\0\0\0\0@\0008\0\t\0@\0\'\0&\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0\300Y<\0\0\0@\0\300Y<\0\0\0\370\1\0\0\0\0\0\0\370\1\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0@\375\0\0\0\0\0\0@\375\300Y<\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=141208, ...}) = 0 mmap(0x3c59c00000, 2200432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c59c00000 mprotect(0x3c59c15000, 2093056, PROT_NONE) = 0 mmap(0x3c59e14000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x3c59e14000 mmap(0x3c59e16000, 13168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c59e16000 close(3) = 0 open("/lib64/libdl.so.2", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \16\200Y<\0\0\0@\0\0\0\0\0\0\0\240R\0\0\0\0\0\0\0\0\0\0@\0008\0\t\0@\0%\0$\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0\200Y<\0\0\0@\0\200Y<\0\0\0\370\1\0\0\0\0\0\0\370\1\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0\240\32\0\0\0\0\0\0\240\32\200Y<\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=23520, ...}) = 0 mmap(0x3c59800000, 2109728, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c59800000 mprotect(0x3c59802000, 2097152, PROT_NONE) = 0 mmap(0x3c59a02000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x3c59a02000 close(3) = 0 open("/usr/lib64/libcman.so.2", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p\20\200Z<\0\0\0@\0\0\0\0\0\0\0\240L\0\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\200Z<\0\0\0\0\0\200Z<\0\0\0\34A\0\0\0\0\0\0\34A\0\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0 A\0\0\0\0\0\0 A\240Z<\0\0\0 A\240Z<"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=21472, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaab9000 mmap(0x3c5a800000, 2114456, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5a800000 mprotect(0x3c5a805000, 2093056, PROT_NONE) = 0 mmap(0x3c5aa04000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x3c5aa04000 close(3) = 0 open("/usr/lib64/libdlm.so.2", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\27\300Z<\0\0\0@\0\0\0\0\0\0\0\20H\0\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\300Z<\0\0\0\0\0\300Z<\0\0\0L;\0\0\0\0\0\0L;\0\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0P;\0\0\0\0\0\0P;\340Z<\0\0\0P;\340Z<\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=20304, ...}) = 0 mmap(0x3c5ac00000, 2113272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5ac00000 mprotect(0x3c5ac04000, 2093056, PROT_NONE) = 0 mmap(0x3c5ae03000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x3c5ae03000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\331AY<\0\0\0@\0\0\0\0\0\0\0P\211\31\0\0\0\0\0\0\0\0\0@\0008\0\n\0@\0M\0L\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0 at Y<\0\0\0@\0 at Y<\0\0\0000\2\0\0\0\0\0\0000\2\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0\240\257\21\0\0\0\0\0\240\257QY<\0\0\0\240\257"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1678480, ...}) = 0 mmap(0x3c59400000, 3461272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c59400000 mprotect(0x3c59544000, 2097152, PROT_NONE) = 0 mmap(0x3c59744000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x144000) = 0x3c59744000 mmap(0x3c59749000, 16536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c59749000 close(3) = 0 open("/usr/lib64/libz.so.1", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\36 at Z<\0\0\0@\0\0\0\0\0\0\0(G\1\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\35\0\34\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0 at Z<\0\0\0\0\0 at Z<\0\0\0\3648\1\0\0\0\0\0\3648\1\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0\3708\1\0\0\0\0\0\3708aZ<\0\0\0\3708aZ<\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=85608, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaba000 mmap(0x3c5a400000, 2178600, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5a400000 mprotect(0x3c5a414000, 2093056, PROT_NONE) = 0 mmap(0x3c5a613000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13000) = 0x3c5a613000 close(3) = 0 open("/lib64/libm.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200>\0Z<\0\0\0@\0\0\0\0\0\0\0\240X\t\0\0\0\0\0\0\0\0\0@\0008\0\t\0@\0)\0(\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0\0Z<\0\0\0@\0\0Z<\0\0\0\370\1\0\0\0\0\0\0\370\1\0\0\0\0\0\0\10\0\0\0\0\0\0\0\3\0\0\0\4\0\0\0\260\304\7\0\0\0\0\0\260\304\7Z<\0\0\0\260"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=615136, ...}) = 0 mmap(0x3c5a000000, 2629848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c5a000000 mprotect(0x3c5a082000, 2093056, PROT_NONE) = 0 mmap(0x3c5a281000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x81000) = 0x3c5a281000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaabb000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaabc000 arch_prctl(ARCH_SET_FS, 0x2aaaaaabba00) = 0 mprotect(0x3c59e14000, 4096, PROT_READ) = 0 mprotect(0x3c59a02000, 4096, PROT_READ) = 0 mprotect(0x3c59744000, 16384, PROT_READ) = 0 mprotect(0x3c5a281000, 4096, PROT_READ) = 0 mprotect(0x3c59219000, 4096, PROT_READ) = 0 munmap(0x2aaaaaaac000, 45165) = 0 set_tid_address(0x2aaaaaabba90) = 393 set_robust_list(0x2aaaaaabbaa0, 0x18) = 0 rt_sigaction(SIGRTMIN, {0x3c59c05350, [], SA_RESTORER|SA_SIGINFO, 0x3c59c0dd40}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {0x3c59c052a0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x3c59c0dd40}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0 geteuid() = 0 getuid() = 0 stat("/var/run/clurgmgrd.pid", {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 brk(0) = 0xc4f2000 brk(0xc513000) = 0xc513000 open("/var/run/clurgmgrd.pid", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000 read(3, "369", 4096) = 3 read(3, "", 4096) = 0 close(3) = 0 munmap(0x2aaaaaaac000, 4096) = 0 open("/proc/369", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, ~[QUIT ILL TRAP ABRT BUS FPE SEGV RTMIN RT_1], NULL, 8) = 0 clone(Process 394 attached (waiting for parent) Process 394 resumed (parent 393 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaaabba90) = 394 [pid 393] exit_group(0) = ? Process 393 detached setsid() = 394 chdir("/") = 0 open("/dev/null", O_RDWR) = 3 fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0 dup2(3, 0) = 0 dup2(3, 1) = 1 dup2(3, 2) = 2 close(3) = 0 open("/var/run/clurgmgrd.pid", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000 write(3, "394", 3) = 3 close(3) = 0 munmap(0x2aaaaaaac000, 4096) = 0 getpriority(PRIO_PROCESS, 0) = 20 setpriority(PRIO_PROCESS, 0, 4294967295) = 0 getpriority(PRIO_PROCESS, 0) = 21 clone(Process 395 attached (waiting for parent) Process 395 resumed (parent 394 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaaabba90) = 395 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0 [pid 394] rt_sigaction(SIG_0, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = -1 EINVAL (Invalid argument) [pid 394] rt_sigprocmask(SIG_UNBLOCK, [HUP], NULL, 8) = 0 [pid 394] rt_sigaction(SIGHUP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [INT], NULL, 8) = 0 [pid 394] rt_sigaction(SIGINT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [QUIT], NULL, 8) = 0 [pid 394] rt_sigaction(SIGQUIT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [ILL], NULL, 8) = 0 [pid 394] rt_sigaction(SIGILL, {SIG_DFL}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [TRAP], NULL, 8) = 0 [pid 394] rt_sigaction(SIGTRAP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 [pid 394] rt_sigaction(SIGABRT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [BUS], NULL, 8) = 0 [pid 394] rt_sigaction(SIGBUS, {SIG_DFL}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [FPE], NULL, 8) = 0 [pid 394] rt_sigaction(SIGFPE, {SIG_DFL}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [KILL], NULL, 8) = 0 [pid 394] rt_sigaction(SIGKILL, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = -1 EINVAL (Invalid argument) [pid 394] rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0 [pid 394] rt_sigaction(SIGUSR1, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0 [pid 394] rt_sigaction(SIGSEGV, {SIG_DFL}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [USR2], NULL, 8) = 0 [pid 394] rt_sigaction(SIGUSR2, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [PIPE], NULL, 8) = 0 [pid 394] rt_sigaction(SIGPIPE, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0 [pid 394] rt_sigaction(SIGALRM, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0 [pid 394] rt_sigaction(SIGTERM, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [STKFLT], NULL, 8) = 0 [pid 394] rt_sigaction(SIGSTKFLT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 395] socket(PF_FILE, SOCK_STREAM, 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [CHLD], [pid 395] <... socket resumed> ) = 3 [pid 394] <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 395] fcntl(3, F_SETFD, FD_CLOEXEC) = 0 [pid 394] rt_sigaction(SIGCHLD, {SIG_DFL}, [pid 395] connect(3, {sa_family=AF_FILE, path="/var/run/cman_client"}, 110 [pid 394] <... rt_sigaction resumed> {SIG_DFL}, 8) = 0 [pid 395] <... connect resumed> ) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [CONT], [pid 395] open("/dev/zero", O_RDONLY [pid 394] <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 395] <... open resumed> ) = 4 [pid 394] rt_sigaction(SIGCONT, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, [pid 395] writev(3, [{"NAMC\3\0\0\20\24\0\0\0\5\0\0\0\0\0\0\0", 20}], 1 [pid 394] <... rt_sigaction resumed> {SIG_DFL}, 8) = 0 [pid 395] <... writev resumed> ) = 20 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [STOP], [pid 395] recvfrom(3, [pid 394] <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 395] <... recvfrom resumed> "NAMC\0\0\0\0\30\0\0\0\5\0\0@\0\0\0\0", 20, 0, NULL, NULL) = 20 [pid 395] read(3, [pid 394] rt_sigaction(SIGSTOP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, [pid 395] <... read resumed> "\1\0\0\0", 4) = 4 [pid 394] <... rt_sigaction resumed> {SIG_DFL}, 8) = -1 EINVAL (Invalid argument) [pid 395] pipe( [pid 394] rt_sigprocmask(SIG_UNBLOCK, [TSTP], [pid 395] <... pipe resumed> [5, 6]) = 0 [pid 395] fcntl(5, F_GETFL [pid 394] <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 395] <... fcntl resumed> ) = 0 (flags O_RDONLY) [pid 395] fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK [pid 394] rt_sigaction(SIGTSTP, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, [pid 395] <... fcntl resumed> ) = 0 [pid 394] <... rt_sigaction resumed> {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [TTIN], [pid 395] open("/dev/misc/dlm_rgmanager", O_RDWR [pid 394] <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 394] rt_sigaction(SIGTTIN, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [TTOU], NULL, 8) = 0 [pid 394] rt_sigaction(SIGTTOU, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [URG], NULL, 8) = 0 [pid 394] rt_sigaction(SIGURG, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [XCPU], NULL, 8) = 0 [pid 394] rt_sigaction(SIGXCPU, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [XFSZ], NULL, 8) = 0 [pid 394] rt_sigaction(SIGXFSZ, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [VTALRM], NULL, 8) = 0 [pid 394] rt_sigaction(SIGVTALRM, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [PROF], NULL, 8) = 0 [pid 394] rt_sigaction(SIGPROF, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [WINCH], NULL, 8) = 0 [pid 394] rt_sigaction(SIGWINCH, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [IO], NULL, 8) = 0 [pid 394] rt_sigaction(SIGIO, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [PWR], NULL, 8) = 0 [pid 394] rt_sigaction(SIGPWR, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [SYS], NULL, 8) = 0 [pid 394] rt_sigaction(SIGSYS, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RTMIN], NULL, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_1], NULL, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_2], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_2, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_3], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_3, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_4], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_4, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_5], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_5, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_6], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_6, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_7], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_7, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_8], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_8, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_9], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_9, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_10], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_10, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_11], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_11, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_12], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_12, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_13], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_13, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_14], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_14, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_15], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_15, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_16], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_16, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_17], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_17, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_18], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_18, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_19], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_19, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_20], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_20, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_21], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_21, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_22], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_22, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_23], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_23, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_24], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_24, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_25], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_25, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_26], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_26, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_27], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_27, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_28], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_28, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_29], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_29, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_30], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_30, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [RT_31], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_31, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0 [pid 394] rt_sigaction(SIGRT_32, {0x411210, [], SA_RESTORER, 0x3c59c0dd40}, {SIG_DFL}, 8) = 0 [pid 394] wait4(395, Process 394 suspended [pid 395] <... open resumed> ) = -1 ENODEV (No such device) [pid 395] stat("/dev/misc/dlm-control", {st_mode=S_IFCHR|0600, st_rdev=makedev(10, 62), ...}) = 0 [pid 395] open("/proc/misc", O_RDONLY) = 7 [pid 395] fstat(7, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 [pid 395] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000 [pid 395] read(7, "209 cpqci\n 60 dlm_clvmd\n 61 lock_dlm_plock\n 62 dlm-control\n 63 device-mapper\n144 nvram\n228 hpet\n135 rtc\n231 snapshot\n227 mcelog\n", 4096) = 128 [pid 395] close(7) = 0 [pid 395] munmap(0x2aaaaaaac000, 4096) = 0 [pid 395] open("/dev/misc/dlm-control", O_RDWR) = 7 [pid 395] fcntl(7, F_SETFD, FD_CLOEXEC) = 0 [pid 395] write(7, "\5\0\0\0\0\0\0\0\0\0\0\0\4\1\0\0\0\0\0\0\0\0\0\0rgmanager\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\300\233!Y<\0\0\0\200 O\f\0\0\0\0\330\333A\0\0\0\0\0\320*\0\203\377\177\0\0\0\0\0\0\0\0\0\0\263\270DY<\0\0\0 ", 113) = -1 EEXIST (File exists) [pid 395] open("/proc/misc", O_RDONLY) = 8 [pid 395] fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 [pid 395] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000 [pid 395] read(8, "209 cpqci\n 60 dlm_clvmd\n 61 lock_dlm_plock\n 62 dlm-control\n 63 device-mapper\n144 nvram\n228 hpet\n135 rtc\n231 snapshot\n227 mcelog\n", 4096) = 128 [pid 395] read(8, "", 4096) = 0 [pid 395] close(8) = 0 [pid 395] munmap(0x2aaaaaaac000, 4096) = 0 [pid 395] stat("/dev/misc/dlm_rgmanager", {st_mode=S_IFCHR|0644, st_rdev=makedev(10, 0), ...}) = 0 [pid 395] stat("/dev/misc/dlm_rgmanager", {st_mode=S_IFCHR|0644, st_rdev=makedev(10, 0), ...}) = 0 [pid 395] open("/dev/misc/dlm_rgmanager", O_RDWR) = -1 ENODEV (No such device) [pid 395] write(2, "failed acquiring lockspace: No such device\n", 43) = 43 [pid 395] fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0 [pid 395] ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff83002690) = -1 ENOTTY (Inappropriate ioctl for device) [pid 395] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000 [pid 395] write(1, "Locks not working!\n", 19) = 19 [pid 395] exit_group(-1) = ? Process 394 resumed Process 395 detached <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], 0, NULL) = 395 --- SIGCHLD (Child exited) @ 0 (0) --- exit_group(255) = ? Process 394 detached From janne.peltonen at helsinki.fi Sun Jul 1 11:30:40 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Sun, 1 Jul 2007 14:30:40 +0300 Subject: [Linux-cluster] Rgmanager fails to restart In-Reply-To: <20070701111748.GA9103@helsinki.fi> References: <20070701111748.GA9103@helsinki.fi> Message-ID: <20070701113040.GB9103@helsinki.fi> On Sun, Jul 01, 2007 at 02:17:48PM +0300, Janne Peltonen wrote: > Hi! > > Sometimes, when I have cleanly shut down rgmanager on one node, and the > services have nicely migrated to other nodes, trying to start rgmanager > fails. Trying to access /dev/misc/dlm_rgmanager results in "No such > device". clurgmgrd concludes that locks are not working and exits. > (See strace output attached.) Interesting. After the one node with failing rgmanagers was shot in the head (there were no log lines about fencing, only two about deferring fencing to an earlier node), the fenced node was left in 'off' state, and, well, the other nodes had their services left running (but rgmanagers apparently stuck - no more status checks an no response to the clustat command). The node that (apparently, since there is no log entry) did the fencing: [jmmpelto at pcn2 ~]$ sudo cman_tool services type level name id state fence 0 default 00010001 FAIL_ALL_STOPPED [1 2 3 4 100] dlm 1 clvmd 00010002 FAIL_ALL_STOPPED [1 2 3 4 100] dlm 1 rgmanager 00020002 FAIL_ALL_STOPPED [1 2 3 4] Other nodes with rgmanager running: [jmmpelto at pcn3 ~]$ sudo cman_tool services type level name id state fence 0 default 00010001 FAIL_START_WAIT [2 3 4 100] dlm 1 clvmd 00010002 FAIL_ALL_STOPPED [1 2 3 4 100] dlm 1 rgmanager 00020002 FAIL_ALL_STOPPED [1 2 3 4] The fifth node without rgmanager: [jmmpelto at pcnm ~]$ sudo cman_tool services type level name id state fence 0 default 00010001 FAIL_START_WAIT [2 3 4 100] dlm 1 clvmd 00010002 FAIL_ALL_STOPPED [1 2 3 4 100] Er. What might be up. --Janne From janne.peltonen at helsinki.fi Sun Jul 1 11:45:21 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Sun, 1 Jul 2007 14:45:21 +0300 Subject: [Linux-cluster] Rgmanager fails to restart In-Reply-To: <20070701113040.GB9103@helsinki.fi> References: <20070701111748.GA9103@helsinki.fi> <20070701113040.GB9103@helsinki.fi> Message-ID: <20070701114521.GC9103@helsinki.fi> The story continues... On Sun, Jul 01, 2007 at 02:30:40PM +0300, Janne Peltonen wrote: > > Sometimes, when I have cleanly shut down rgmanager on one node, and the > > services have nicely migrated to other nodes, trying to start rgmanager > > fails. Trying to access /dev/misc/dlm_rgmanager results in "No such > > device". clurgmgrd concludes that locks are not working and exits. > > (See strace output attached.) > Interesting. After the one node with failing rgmanagers was shot in the > head (there were no log lines about fencing, only two about deferring > fencing to an earlier node), the fenced node was left in 'off' state, and, > well, the other nodes had their services left running (but rgmanagers > apparently stuck - no more status checks an no response to the clustat > command). Now, the cluster node whose fencing resulted in a stuck system came up and joined the cluster. [jmmpelto at pcn1 ~]$ sudo cman_tool services type level name id state fence 0 default 00000000 JOIN_STOP_WAIT [1 2 3 4 100] dlm 1 clvmd 00000000 JOIN_STOP_WAIT [1 2 3 4 100] [jmmpelto at pcn1 ~]$ sudo cman_tool status Version: 6.0.1 Config Version: 40 Cluster Name: mappi-primary Cluster Id: 11929 Cluster Member: Yes Cluster Generation: 184 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 8 Flags: Ports Bound: 0 11 Node name: pcn1-hb Node ID: 1 Multicast addresses: 239.192.46.199 Node addresses: 10.3.0.11 I killed the completely stuck pcn2-hb from there: [jmmpelto at pcn1 ~]$ sudo cman_tool kill -n pcn2-hb Log: Jul 1 14:36:36 pcn2.mappi.helsinki.fi dlm_controld[4577]: cluster is down, exiting Jul 1 14:36:36 pcn2.mappi.helsinki.fi gfs_controld[4583]: cluster is down, exiting Jul 1 14:36:36 pcn2.mappi.helsinki.fi fenced[4571]: cluster is down, exiting Jul 1 14:36:59 pcn2.mappi.helsinki.fi ccsd[4508]: Unable to connect to cluster infrastructure after 30 seconds. Thereafter, node pcn3-hb fenced it, this time with log entries: Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn2-hb not a cluster member after 0 sec post_fail_delay Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn1-hb not a cluster member after 0 sec post_fail_delay Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: fencing node "pcn2-hb" Jul 1 14:38:08 pcn3.mappi.helsinki.fi fenced[4371]: fence "pcn2-hb" success Jul 1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Attempt to close an unopened CCS descriptor (3012450). Jul 1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Error while processing disconnect: Invalid request descriptor But nobody tried to fence pcn1-hb (see the second log line). But apparently, pcn3-hb tried to say something to pcn1-hb. Jul 1 14:38:13 pcn1.mappi.helsinki.fi fenced[4461]: fencing deferred to prior member Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/id" error -1 2 Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/control" error -1 2 Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/event_done" error -1 2 This time the services are in no specific state, but the rgmanager still does nothin constructive: [jmmpelto at pcn3 ~]$ sudo cman_tool services Password: type level name id state fence 0 default 00010001 none [1 3 4 100] dlm 1 clvmd 00010002 none [1 3 4 100] dlm 1 rgmanager 00020002 none [1 3 4] [jmmpelto at pcn3 ~]$ sudo clustat Timed out waiting for a response from Resource Group Manager Member Status: Quorate Member Name ID Status ------ ---- ---- ------ pcnm-hb 100 Online pcn1-hb 1 Online pcn2-hb 2 Offline pcn3-hb 3 Online, Local pcn4-hb 4 Online On node pcn1-hb: [jmmpelto at pcn1 ~]$ sudo cman_tool services type level name id state fence 0 default 00010001 none [1 3 4 100] dlm 1 clvmd 00010002 none [1 3 4 100] dlm 1 rgmanager 00020002 none [1 3 4] [jmmpelto at pcn1 ~]$ [jmmpelto at pcn1 ~]$ [jmmpelto at pcn1 ~]$ sudo clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ pcnm-hb 100 Online pcn1-hb 1 Online, Local pcn2-hb 2 Offline pcn3-hb 3 Online pcn4-hb 4 Online Er again. --Janne From janne.peltonen at helsinki.fi Sun Jul 1 11:52:01 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Sun, 1 Jul 2007 14:52:01 +0300 Subject: [Linux-cluster] Rgmanager fails to restart In-Reply-To: <20070701114521.GC9103@helsinki.fi> References: <20070701111748.GA9103@helsinki.fi> <20070701113040.GB9103@helsinki.fi> <20070701114521.GC9103@helsinki.fi> Message-ID: <20070701115201.GD9103@helsinki.fi> On Sun, Jul 01, 2007 at 02:45:21PM +0300, Janne Peltonen wrote: > Er again. At this point, I said 'cman_tool leave force' on pcn1-hb, which resulted in pcn3-hb fencing it. This time the fencing was successful, and the rgmanagers on remaining nodes woke up. All kinds of mayhem... It's always a ten-minute break in services if a service has to be relocated from a node to another (the services are a bit slow to start) (or longer, if a node is really down and the rgmanagers elsewhere are stuck). --Janne -- Janne Peltonen From janne.peltonen at helsinki.fi Sun Jul 1 12:12:39 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Sun, 1 Jul 2007 15:12:39 +0300 Subject: [Linux-cluster] At which point is a service relocated? Message-ID: <20070701121239.GE9103@helsinki.fi> Hi! It seems to me that if a node comes up that is higher in the prioritized failover domain of a service than the node that was running it, the service doesn't always relocate. Is there some documentation on this somewhere? It appears that the service relocates if the difference between the priorities of the old and new node is at least two. Is there a way to modify this? Thanks. --Janne From David.Schroeder at flinders.edu.au Sun Jul 1 23:37:15 2007 From: David.Schroeder at flinders.edu.au (David Schroeder) Date: Mon, 02 Jul 2007 09:07:15 +0930 Subject: [Linux-cluster] IP monitor failing periodically In-Reply-To: <4686A3BF.9070609@cmiware.com> References: <4686A3BF.9070609@cmiware.com> Message-ID: <46883AAB.8000801@flinders.edu.au> Hi Chris, I am experiencing the same problem on RHEL 5 and I have a support request in with RedHat. I was asked to increase the debug level by changing the line in the cluster configuration to: I then needed to add "local4.* /var/log/cluster" to /etc/syslog.conf and run "service syslog restart". To update the cluster configuration I needed to propagate the cluster configuration to both nodes: # ccs_tool update /etc/cluster/cluster.conf After a week I have not had the problem with the increased logging despite the problem occurring regularly prior to that - 2 to 3 times a day. One day last week out of curiosity I reverted to the default settings and within a few hours I had the failure to ping error on one of the clustered IP addresses and the service was restarted. I now have the logging back at 7 and the support request is pending. Regards -- David Schroeder Server Support Information Services Division Flinders University Adelaide, Australia Ph: +61 8 8201 2689 Chris Harms wrote: > I am experiencing periodic failovers due to a floating IP address not > passing the status check: > > clurgmgrd: [9975]: Failed to ping 192.168.13.204 > Jun 30 11:41:47 nodeA clurgmgrd[9975]: status on ip > "192.168.13.204" returned 1 (generic error) > > Both nodes have bonded NICs with gigabit connections to redundant > switches, so it is unlikely they are going down, nothing in the logs > about linux losing the links. I parked all the cluster services - 2 > Postgres services and 1 Apache - on one node and allowed it to run > overnight. There would be no client activity during this time. One > Postgres service failed two times in this manner and the other failed > once in this manner. The Apache service did not fail. > > What can I do to resolve this or get more information out of the system > to resolve this? > > Thanks in advance, > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From pastany at gmail.com Mon Jul 2 07:23:35 2007 From: pastany at gmail.com (pastany) Date: Mon, 2 Jul 2007 15:23:35 +0800 Subject: [Linux-cluster] SCSI Error Message-ID: <200706272056059379538@gmail.com> I am running a 4 node cluster with a fc switch and a fujitsu fc san but i recevie this message ,and some partions dont work Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 668992848 Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: fatal: I/O error Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: block = 83624058 Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: function = gfs_dreread Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: file = /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dio.c, line = 576 Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: time = 1182943429 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636680 Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: about to withdraw from the cluster Jun 27 19:23:49 test1 kernel: GFS: fsid=tvod:vod.0: waiting for outstanding I/O Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636688 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636696 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636704 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636712 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636720 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636728 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector 418636736 Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:50 test1 kernel: end_request: I/O error, dev sdb, sector 418636744 Jun 27 19:23:50 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:50 test1 kernel: end_request: I/O error, dev sdb, sector 418636752 Jun 27 19:23:50 test1 kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jun 27 19:23:50 test1 kernel: end_request: I/O error, dev sdb, sector 418636760 Jun 27 19:23:50 test1 kernel: GFS: fsid=tvod:vod.0: telling LM to withdraw Jun 27 19:33:37 test1 kernel: lock_dlm: withdraw abandoned memory Jun 27 19:33:37 test1 kernel: GFS: fsid=tvod:vod.0: withdraw pastany 2007-06-27 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Mon Jul 2 13:58:20 2007 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 02 Jul 2007 08:58:20 -0500 Subject: [Linux-cluster] SCSI Error In-Reply-To: <200706272056059379538@gmail.com> References: <200706272056059379538@gmail.com> Message-ID: <1183384700.11507.75.camel@technetium.msp.redhat.com> On Mon, 2007-07-02 at 15:23 +0800, pastany wrote: > I am running a 4 node cluster with a fc switch and a fujitsu fc san > but i recevie this message ,and some partions dont work > > Jun 27 19:23:49 test1 kernel: SCSI error : <3 0 0 0> return code = > 0x10000 > Jun 27 19:23:49 test1 kernel: end_request: I/O error, dev sdb, sector > 668992848 Hi Pastany, This sounds like a hardware problem to me, not a GFS problem. It could be a bad drive, bad san or bad Host Bus Adapter (HBA). Perhaps you should unmount the san from all the nodes, then from one node, do a simple read test of the san: 1. Check dmesg and maybe clear your dmesg buffer: dmesg -c 2. Try reading every sector of the san: dd if=/dev/sdb of=/dev/null bs=1M 3. Check your console / dmesg to see if SCSI errors are reported. You may want to try that separately on a few different nodes just in case the error was caused by a bad HBA in the node that reported the problem. Regards, Bob Peterson Red Hat Cluster Suite From chris at cmiware.com Mon Jul 2 15:26:59 2007 From: chris at cmiware.com (Chris Harms) Date: Mon, 02 Jul 2007 10:26:59 -0500 Subject: [Linux-cluster] issues starting services Message-ID: <46891943.6010202@cmiware.com> Had a nice little hardware failure over the weekend. After having the machine come back on-line, 2 of the registered services didn't start (the fact that they weren't running already is a function of services not failing over until fencing succeeds). Issuing a start operation in Conga did nothing. Issuing clusvcadm -e [service] -m [node] yielded: Member [node] trying to enable service:[service]...Success service:[service] is now running on [node] This was not the case. Nothing happened. Nothing was logged. clusvcadm -R [service] seemed to be the magic bullet. Is that the official way to recover a service? Chris From jwilson at transolutions.net Mon Jul 2 20:10:49 2007 From: jwilson at transolutions.net (James Wilson) Date: Mon, 02 Jul 2007 15:10:49 -0500 Subject: [Linux-cluster] failover not working Message-ID: <46895BC9.5010101@transolutions.net> Hey All, I was just wondering if someone could point out my errors? I currently have 3 servers in a cluster server1(dolphins), server2(lions), server3(patriots). server1 and server3 are being mirrored via DRBD. I have set up the cluster so that if server1 fails then server3 will take over. I have configured a vip to go between the 2 servers. I also do a gnbd_import on this vip from server2. The problem is when ever I pull the plug on server1 the vip never moves over to server3. here is a copy of my cluster.conf. Any help is appreciated. From tomas.hoger at gmail.com Tue Jul 3 11:19:44 2007 From: tomas.hoger at gmail.com (Tomas Hoger) Date: Tue, 3 Jul 2007 13:19:44 +0200 Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster Message-ID: <6cfbd1b40707030419i6fa984b0m37c827f834400cfb@mail.gmail.com> Hi! I've come across a problem with two-node cluster on RHEL 4U3. When I attempt to reboot one of the nodes, it sometimes fails to leave cluster correctly. Before reboot, both nodes are cluster members and it is possible to fail-over services from one node to another. When I try to reboot node1 (active at that time), services fail-over to node2, however, cman fails to stop correctly: cman: Stopping cman: cman: failed to stop cman failed node2 logs following message: kernel: CMAN: removing node node1 from the cluster : Missed too many heartbeats I see no information about fencing attempts in the log. After node1's reboot, it is not able to rejoin cluster any more. node1: kernel: CMAN: Waiting to join or form a Linux-cluster kernel: CMAN: sending membership request kernel: CMAN: got node node2 cman: Timed-out waiting for cluster failed While on node2: kernel: CMAN: node node1 rejoining and after ~4.5 minutes: kernel: CMAN: too many transition restarts - will die kernel: CMAN: we are leaving the cluster. Inconsistent cluster view kernel: WARNING: dlm_emergency_shutdown clurgmgrd[2848]: #67: Shutting down uncleanly kernel: WARNING: dlm_emergency_shutdown kernel: SM: 00000001 sm_stop: SG still joined kernel: SM: 01000003 sm_stop: SG still joined kernel: SM: 03000002 sm_stop: SG still joined ccsd[2242]: Cluster is not quorate. Refusing connection. ccsd[2242]: Error while processing connect: Connection refused ccsd[2242]: Invalid descriptor specified (-111). ccsd[2242]: Someone may be attempting something evil. ccsd[2242]: Error while processing get: Invalid request descriptor ccsd[2242]: Invalid descriptor specified (-111). ccsd[2242]: Someone may be attempting something evil. ccsd[2242]: Error while processing get: Invalid request descriptor ccsd[2242]: Invalid descriptor specified (-21). and again ~1 minute later on node1: kernel: CMAN: removing node node2 from the cluster : No response to messages kernel: ------------[ cut here ]------------ kernel: kernel BUG at /usr/src/build/714635-i686/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150! kernel: invalid operand: 0000 [#1] kernel: SMP kernel: Modules linked in: cman(U) md5 ipv6 iptable_filter ip_tables button battery ac uhci_hcd ehci_hcd hw_random tg3 floppy sg st mptspi mptscsi mptbase dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod kernel: CPU: 0 kernel: EIP: 0060:[] Not tainted VLI kernel: EFLAGS: 00010246 (2.6.9-34.ELsmp) kernel: EIP is at elect_master+0x2e/0x3a [cman] kernel: eax: 00000000 ebx: f7b4afa0 ecx: 00000080 edx: 00000080 kernel: esi: f8bff044 edi: f7b4afd8 ebp: 00000000 esp: f7b4af98 kernel: ds: 007b es: 007b ss: 0068 kernel: Process cman_memb (pid: 2429, threadinfo=f7b4a000 task=c1a33230) kernel: Stack: f8bfef08 f8be98d1 c1a7c580 f6e8ee00 f8be7eb7 c1a33230 c1a33230 f8be809a kernel: 0000001f 00000000 f7b460b0 00000000 c1a33230 c011e71b 00100100 00200200 kernel: 00000000 00000000 0000007b f8be7ed8 00000000 00000000 c01041f5 00000000 kernel: Call Trace: kernel: [] a_node_just_died+0x13a/0x199 [cman] kernel: [] process_dead_nodes+0x4e/0x6f [cman] kernel: [] membership_kthread+0x1c2/0x39d [cman] kernel: [] default_wake_function+0x0/0xc kernel: [] membership_kthread+0x0/0x39d [cman] kernel: [] kernel_thread_helper+0x5/0xb kernel: Code: 28 fe bf f8 89 c3 ba 01 00 00 00 39 ca 7d 1c a1 2c fe bf f8 8b 04 90 85 c0 74 0d 83 78 1c 02 75 07 89 03 8b 40 14 eb 0d 42 eb e0 <0f> 0b 4e 0c 73 2d bf f8 31 c0 5b c3 a1 2c fe bf f8 e8 79 70 56 kernel: <0>Fatal exception: panic in 5 seconds During one other test, cluster did not crash, it just ended in the state, when cman on rebooted node kept sending cluster membership requests and those requests were ignored by other cluster node. Output of tcpdump showed traffic was reaching active node, but there was no reply nor any message in the logs of active node. Only way to get to normal state is to restart cman on active node (or reboot both nodes). If I try to reboot one of cluster nodes shortly after rebooting both nodes, it seems to leave and rejoin cluster successfully. Has anyone observed similar behavior? Is this known bug in U3, which can be resolved by upgrade to latest version? I've checked changelogs and release notes (Btw, any chance to get back to "old" release notes format for RHCS? Release notes for U5 do not longer list fixed bugzilla reports, only links some errata listings, which do not seem to be accessible from Internet.), but haven't found any obvious reference to this king of problem. Ideas appreciated. th. From jbrassow at redhat.com Tue Jul 3 20:59:59 2007 From: jbrassow at redhat.com (Jonathan Brassow) Date: Tue, 3 Jul 2007 15:59:59 -0500 Subject: [Linux-cluster] failover not working In-Reply-To: <46895BC9.5010101@transolutions.net> References: <46895BC9.5010101@transolutions.net> Message-ID: <531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com> On Jul 2, 2007, at 3:10 PM, James Wilson wrote: > > > > ordered="1"> > priority="1"/> > priority="2"/> > > ordered="1" restricted="0"> > priority="2"/> > priority="1"/> > > > > > > name="dolphins-svc-drbd1" recovery="relocate"> > > > name="dolphins-svc-drbd2" recovery="relocate"> > > > That section looks funny. You don't need the two 'failoverdomain's; and you don't need the two services. Does rgmanager even startup? Check /var/log/messages for more info. brassow From jwilson at transolutions.net Tue Jul 3 21:08:11 2007 From: jwilson at transolutions.net (James Wilson) Date: Tue, 03 Jul 2007 16:08:11 -0500 Subject: [Linux-cluster] failover not working In-Reply-To: <531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com> References: <46895BC9.5010101@transolutions.net> <531F9128-A7EB-4359-AC1D-ADCB5E41DB3F@redhat.com> Message-ID: <468ABABB.6080104@transolutions.net> I have changed it since I posted this. But I though I needed the 2 failover domains? One for each host so if dolphins fails it failsover to patriots and vice versa. Or do I just need one because of the virtual IP? Jonathan Brassow wrote: > > On Jul 2, 2007, at 3:10 PM, James Wilson wrote: > >> >> >> >> > ordered="1"> >> > priority="1"/> >> > priority="2"/> >> >> > ordered="1" restricted="0"> >> > priority="2"/> >> > priority="1"/> >> >> >> >> >> >> > name="dolphins-svc-drbd1" recovery="relocate"> >> >> >> > name="dolphins-svc-drbd2" recovery="relocate"> >> >> >> > > That section looks funny. You don't need the two 'failoverdomain's; > and you don't need the two services. Does rgmanager even startup? > Check /var/log/messages for more info. > > brassow > From chris at cmiware.com Tue Jul 3 23:20:23 2007 From: chris at cmiware.com (Chris Harms) Date: Tue, 03 Jul 2007 18:20:23 -0500 Subject: [Linux-cluster] dual fence redux Message-ID: <468AD9B7.4060100@cmiware.com> To recap: I am attempting to setup a 2 node cluster where each will run a DB and an apache service to be failed over between them. Both are fenced via Dell DRAC connected via the system NICs (this adds to the issue, but manual fencing is broken). My test case so far is to unplug the network cables from one node and then reconnect them. For some reason, both machines get halted instead of one machine being fenced. Having only one node fenced in this scenario has only occurred successfully one time. I previously suspected DRBD as being the culprit, but I can now rule this out after performing the cable pull test without RHCS running, and having DRBD in every possible configuration the cluster could put it in, including a split brain (which is impossible for me due to services not failing over until fencing occurs). Is there any component of the cluster system that would issue the shutdown command shown in the log entry below? [From logs on Node A] Jul 3 17:36:20 nodeA openais[3504]: [MAIN ] Killing node nodeB because it has rejoined the cluster without cman_tool join Jul 3 17:36:20 nodeA kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Jul 3 17:36:21 nodeA kernel: drbd0: Writing meta data super block now. Jul 3 17:36:21 nodeA kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Jul 3 17:36:21 nodeA kernel: drbd0: Began resync as SyncSource (will sync 56 KB [14 bits set]). Jul 3 17:36:21 nodeA kernel: drbd0: Writing meta data super block now. Jul 3 17:36:21 nodeA kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 56 K/sec) Jul 3 17:36:21 nodeA kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Jul 3 17:36:21 nodeA kernel: drbd0: Writing meta data super block now. Jul 3 17:36:21 nodeA shutdown[18845]: shutting down for system halt Thanks to a hardware issue on NodeB, I am unable to get to the logs off of it presently. From bsd_daemon at msn.com Wed Jul 4 08:52:37 2007 From: bsd_daemon at msn.com (mehmet celik) Date: Wed, 04 Jul 2007 08:52:37 +0000 Subject: [Linux-cluster] failover not working In-Reply-To: <468ABABB.6080104@transolutions.net> Message-ID: You should use Linux H.A. (heartbeat) for failover. the rgmanager very good but not yet. you not must use failover with rgmanager. you can get some problems because of priority and two failover happened some the wrong. have a nice day.. >From: James Wilson >Reply-To: jwilson at transolutions.net,linux clustering > >To: Jonathan Brassow >CC: linux clustering >Subject: Re: [Linux-cluster] failover not working >Date: Tue, 03 Jul 2007 16:08:11 -0500 > >I have changed it since I posted this. But I though I needed the 2 failover >domains? One for each host so if dolphins fails it failsover to patriots >and vice versa. Or do I just need one because of the virtual IP? > > > > > priority="1"/> > priority="2"/> > > restricted="0"> > priority="2"/> > priority="1"/> > > > > > > name="dolphins-svc-drbd1" recovery="relocate"> > > > name="dolphins-svc-drbd2" recovery="relocate"> > > > > > > > >Jonathan Brassow wrote: >> >>On Jul 2, 2007, at 3:10 PM, James Wilson wrote: >> >>> >>> >>> >>> >>ordered="1"> >>> >>priority="1"/> >>> >>priority="2"/> >>> >>> >>restricted="0"> >>> >>priority="2"/> >>> >>priority="1"/> >>> >>> >>> >>> >>> >>> >>name="dolphins-svc-drbd1" recovery="relocate"> >>> >>> >>> >>name="dolphins-svc-drbd2" recovery="relocate"> >>> >>> >>> >> >>That section looks funny. You don't need the two 'failoverdomain's; and >>you don't need the two services. Does rgmanager even startup? Check >>/var/log/messages for more info. >> >> brassow >> > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster _________________________________________________________________ http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_pcmag_0507 From Donald.Deasy at sli-institute.ac.uk Wed Jul 4 10:01:49 2007 From: Donald.Deasy at sli-institute.ac.uk (Donald Deasy) Date: Wed, 4 Jul 2007 11:01:49 +0100 Subject: [Linux-cluster] Is cluster 1 RHEL45 correct branch for 2.6.9-55.0.2 ? Message-ID: <7D7A68F7F09DAE40AE47E55F7F601D8BD37225@SLISERVER21.sli-institute.ac.uk> Just tried compiling cluster on 2.6.9-55.0.2 from the RHEL45 CVS repository. Got a few errors which I fixed by creating links to the build/incdir/cluster directory, which may be the problem ! Cannot figure out why it's failing at gfs/gfs_edit make[4]: Entering directory `/root/sources.redhat.com/RHEL45/cluster/gfs/gfs_edit' gcc -Wall -I../include -I../config -I/root/sources.redhat.com/RHEL45/cluster/build/incdir -DHELPER_PROGRAM -D_FILE_OFFSET_BITS=64 -DGFS_RELEASE_NAME=\"DEVEL.1183478317\" -I../include -I../config -I/root/sources.redhat.com/RHEL45/cluster/build/incdir gfshex.c hexedit.c -lncurses -o gfs_edit In file included from gfshex.c:29: /root/sources.redhat.com/RHEL45/cluster/build/incdir/linux/gfs_ondisk.h: 626: error: syntax error before "__be64" Is cluster 1 RHEL45 correct branch for 2.6.9-55.0.2 ? From bsd_daemon at msn.com Wed Jul 4 09:45:44 2007 From: bsd_daemon at msn.com (mehmet celik) Date: Wed, 04 Jul 2007 09:45:44 +0000 Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster In-Reply-To: <6cfbd1b40707030419i6fa984b0m37c827f834400cfb@mail.gmail.com> Message-ID: hi tomas, when you do restart. which services run on the node1 ??? >node2 logs following message: > >kernel: CMAN: removing node node1 from the cluster : Missed too many >heartbeats when network problem, you get this error. >kernel: CMAN: too many transition restarts - will die >kernel: CMAN: we are leaving the cluster. Inconsistent cluster view >kernel: WARNING: dlm_emergency_shutdown >clurgmgrd[2848]: #67: Shutting down uncleanly >kernel: WARNING: dlm_emergency_shutdown >kernel: SM: 00000001 sm_stop: SG still joined >kernel: SM: 01000003 sm_stop: SG still joined >kernel: SM: 03000002 sm_stop: SG still joined >ccsd[2242]: Cluster is not quorate. Refusing connection. >ccsd[2242]: Error while processing connect: Connection refused >ccsd[2242]: Invalid descriptor specified (-111). >ccsd[2242]: Someone may be attempting something evil. >ccsd[2242]: Error while processing get: Invalid request descriptor >ccsd[2242]: Invalid descriptor specified (-111). >ccsd[2242]: Someone may be attempting something evil. >ccsd[2242]: Error while processing get: Invalid request descriptor >ccsd[2242]: Invalid descriptor specified (-21). > >and again ~1 minute later on node1: > >kernel: CMAN: removing node node2 from the cluster : No response to >messages >kernel: ------------[ cut here ]------------ >kernel: kernel BUG at >/usr/src/build/714635-i686/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150! >kernel: invalid operand: 0000 [#1] >kernel: SMP i thing this error a bug. did you check this error from bugzilla ? _________________________________________________________________ Local listings, incredible imagery, and driving directions - all in one place! http://maps.live.com/?wip=69&FORM=MGAC01 From manjusc13 at rediffmail.com Wed Jul 4 11:35:31 2007 From: manjusc13 at rediffmail.com (manjunath c shanubog) Date: 4 Jul 2007 11:35:31 -0000 Subject: [Linux-cluster] Mysql installation on Cluster Message-ID: <20070704113531.518.qmail@webmail81.rediffmail.com> Hi,      I need complete installation guide for installing cluster using redhat EL 5, and Mysql installtion guide on the cluster.      which fencing device is better whether APC 9120 or NPS 230 from Western telematic.Thanking YouManjunath      -------------- next part -------------- An HTML attachment was scrubbed... URL: From wcheng at redhat.com Wed Jul 4 15:19:11 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Wed, 04 Jul 2007 11:19:11 -0400 Subject: [Linux-cluster] Is cluster 1 RHEL45 correct branch for 2.6.9-55.0.2 ? In-Reply-To: <7D7A68F7F09DAE40AE47E55F7F601D8BD37225@SLISERVER21.sli-institute.ac.uk> References: <7D7A68F7F09DAE40AE47E55F7F601D8BD37225@SLISERVER21.sli-institute.ac.uk> Message-ID: <468BBA6F.1000702@redhat.com> Donald Deasy wrote: >Just tried compiling cluster on 2.6.9-55.0.2 from the RHEL45 CVS >repository. > >Got a few errors which I fixed by creating links to the >build/incdir/cluster directory, which may be the problem ! > >Cannot figure out why it's failing at gfs/gfs_edit >make[4]: Entering directory >`/root/sources.redhat.com/RHEL45/cluster/gfs/gfs_edit' >gcc -Wall -I../include -I../config >-I/root/sources.redhat.com/RHEL45/cluster/build/incdir -DHELPER_PROGRAM >-D_FILE_OFFSET_BITS=64 -DGFS_RELEASE_NAME=\"DEVEL.1183478317\" >-I../include -I../config >-I/root/sources.redhat.com/RHEL45/cluster/build/incdir gfshex.c >hexedit.c -lncurses -o gfs_edit >In file included from gfshex.c:29: >/root/sources.redhat.com/RHEL45/cluster/build/incdir/linux/gfs_ondisk.h: >626: error: syntax error before "__be64" > >Is cluster 1 RHEL45 correct branch for 2.6.9-55.0.2 ? > > > Sorry, it was my oversight. The problem is fixed in RHEL4 (queued for 4.6) branch but not RHEL45. Will have to discuss with our PM to see what needs to be done to get the changes added into RHEL45 branch. In the mean time, please take the patch from: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=239523#c2 Let me know if you have more issues. -- Wendy From tomas.hoger at gmail.com Wed Jul 4 16:36:46 2007 From: tomas.hoger at gmail.com (Tomas Hoger) Date: Wed, 4 Jul 2007 18:36:46 +0200 Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster In-Reply-To: References: <6cfbd1b40707030419i6fa984b0m37c827f834400cfb@mail.gmail.com> Message-ID: <6cfbd1b40707040936y6d60e3dbt5c6c03b171d0e523@mail.gmail.com> On 7/4/07, mehmet celik wrote: > when you do restart. which services run on the node1 ??? Cluster only use one service (consisting of IP, filesystems and applications) and it was running on node1 before it was rebooted. Logs also show that service was moved to node2 during node1's shutdown. > when network problem, you get this error. We haven't noticed any network-related problems. th. From bsd_daemon at msn.com Thu Jul 5 07:20:07 2007 From: bsd_daemon at msn.com (mehmet celik) Date: Thu, 05 Jul 2007 07:20:07 +0000 Subject: [Linux-cluster] Mysql installation on Cluster In-Reply-To: <20070704113531.518.qmail@webmail81.rediffmail.com> Message-ID: hii manjunath, how will you work the mysql cluster ? I know two way for the mysql-cluster. 1. active-passive (failover) 2. active-active you don't active-active, because it's not be with RHCS. for this, you have to visit mysql.com >From: "manjunath c shanubog" >Reply-To: linux clustering >To: >Subject: [Linux-cluster] Mysql installation on Cluster >Date: 4 Jul 2007 11:35:31 -0000 > >Hi,      I need complete installation guide for >installing cluster using redhat EL 5, and Mysql installtion guide on >the cluster.      which fencing device >is better whether APC 9120 or NPS 230 from Western telematic.Thanking >YouManjunath      >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster _________________________________________________________________ http://liveearth.msn.com From bsd_daemon at msn.com Thu Jul 5 07:55:45 2007 From: bsd_daemon at msn.com (mehmet celik) Date: Thu, 05 Jul 2007 07:55:45 +0000 Subject: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster In-Reply-To: <6cfbd1b40707040936y6d60e3dbt5c6c03b171d0e523@mail.gmail.com> Message-ID: hii Tomas, "missed too ... heartbeat ..." this error is generally network and comminucation problems. You should tcpdump for the event. you using tcpdump, find source of this error. example cman to cman communication, 16:40:13.540514 IP node1.domain.com.6809 > 10.0.0.255.6809: UDP, length 28 16:40:14.037110 IP node2.domain.com.6809 > 10.0.0.255.6809: UDP, length 28 16:40:15.059749 IP node3.domain.com.6809 > 10.0.0.255.6809: UDP, length 28 16:41:28.568924 IP node1.domain.com.6809 > 10.0.0.255.6809: UDP, length 28 16:41:29.016120 IP node2.domain.com.6809 > 10.0.0.255.6809: UDP, length 28 16:41:30.046889 IP node3.domain.com.6809 > 10.0.0.255.6809: UDP, length 28 >From: "Tomas Hoger" >Reply-To: linux clustering >To: "linux clustering" >Subject: Re: [Linux-cluster] Cluster rejoin problem - 4U3, two node cluster >Date: Wed, 4 Jul 2007 18:36:46 +0200 > >On 7/4/07, mehmet celik wrote: >>when you do restart. which services run on the node1 ??? > >Cluster only use one service (consisting of IP, filesystems and >applications) and it was running on node1 before it was rebooted. >Logs also show that service was moved to node2 during node1's >shutdown. > >>when network problem, you get this error. > >We haven't noticed any network-related problems. > >th. > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster _________________________________________________________________ http://liveearth.msn.com From teigland at redhat.com Thu Jul 5 18:14:39 2007 From: teigland at redhat.com (David Teigland) Date: Thu, 5 Jul 2007 13:14:39 -0500 Subject: [Linux-cluster] dual fence redux In-Reply-To: <468AD9B7.4060100@cmiware.com> References: <468AD9B7.4060100@cmiware.com> Message-ID: <20070705181439.GA23666@redhat.com> On Tue, Jul 03, 2007 at 06:20:23PM -0500, Chris Harms wrote: > To recap: > I am attempting to setup a 2 node cluster where each will run a DB and > an apache service to be failed over between them. Both are fenced via > Dell DRAC connected via the system NICs (this adds to the issue, but > manual fencing is broken). > > My test case so far is to unplug the network cables from one node and > then reconnect them. For some reason, both machines get halted instead > of one machine being fenced. Having only one node fenced in this > scenario has only occurred successfully one time. > > I previously suspected DRBD as being the culprit, but I can now rule > this out after performing the cable pull test without RHCS running, and > having DRBD in every possible configuration the cluster could put it in, > including a split brain (which is impossible for me due to services not > failing over until fencing occurs). > > Is there any component of the cluster system that would issue the > shutdown command shown in the log entry below? Perhaps qdisk? Dave From bmarzins at redhat.com Thu Jul 5 20:01:28 2007 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Thu, 5 Jul 2007 15:01:28 -0500 Subject: [linux-cluster] multipath issue... Smells of hardware issue. In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk> References: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk> Message-ID: <20070705200128.GC27466@ether.msp.redhat.com> On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer Lippert wrote: > Hi, > > I have a setup with two identical RX200s3 FuSi servers talking to a SAN > (SX60 + extra controller), and that works fine with gfs1. > > I do however see some errors on one of the servers. It's in my message log > and only now and then now and then (though always under load, but i cant > load it and thereby force it to give the error). > > The error says: > Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed > Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active > paths: 1 > Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code = > 0x00070000 > Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector > 705160231 > Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path 8:16. > Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path is > up > Jun 28 15:44:22 app02 multipathd: 8:16: reinstated > Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active > paths: 2 > Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed > Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active > paths: 1 > Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code = > 0x00070000 > Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector > 739870727 > Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path 8:32. > Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path is > up > Jun 28 15:46:06 app02 multipathd: 8:32: reinstated > Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active > paths: 2 > > To me i looks like a fiber that bounces up and down. (There is no switch > involved). > > Sometimes i only get a slightly shorter version: > Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code = > 0x00070000 > Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector > 2782490295 > Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path 8:16. > Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed > Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active > paths: 1 > Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path is > up > Jun 29 09:04:37 app02 multipathd: 8:16: reinstated > Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active > paths: 2 > > Any sugestions, but start swapping hardware? It's possible that your scsi device is timing out the scsi read command from the readsector0 path checker, which is what it appears that your setup is using to check the path status. This checker has it's timeout set to 5 minutes, but I suppose that it is possible to take this long if your hardware is a flaky. If you're willing to recompile the code, you can change this default by changing DEF_TIMEOUT in libcheckers/checkers.h. DEF_TIMEOUT is the scsi command timeout in milliseconds. Otherwise, if you are only seeing this on one server, swapping hardware seems like a reasonable thing to try. -Ben > Mvh / Kind regards > > Kristoffer Lippert > Systemansvarlig > JP/Politiken A/S > Online Magasiner > > Tlf. +45 8738 3032 > Cell. +45 6062 8703 > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From kristoffer.lippert at jppol.dk Fri Jul 6 07:51:24 2007 From: kristoffer.lippert at jppol.dk (Kristoffer Lippert) Date: Fri, 6 Jul 2007 09:51:24 +0200 Subject: SV: [linux-cluster] multipath issue... Smells of hardware issue. In-Reply-To: <20070705200128.GC27466@ether.msp.redhat.com> References: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk> <20070705200128.GC27466@ether.msp.redhat.com> Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A55@exchsrv07.rootdom.dk> Hi, Thank you very much for the explaination. The hardware should under no circumstances take 5 minutes to perform a readsector. Not even when the command queue is very long. I've tried copying files to and from the SAN, and i've tried a little program called sys_basher working the disks continously since last Friday. (almost a week) and i have not been able to reproduce the error. Before i could produce it within an hour by copying files. I've only seen the error on one server, and i've changed nothing. (well, obvouisly something must have changed since the error seems to be gone.) I get a throughput of about 120mb/sec on the san using GFS1. It's fast enough for my use (wich is large files for a website). Is it far below expected throughput? Kind regards Kristoffer -----Oprindelig meddelelse----- Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Benjamin Marzinski Sendt: 5. juli 2007 22:01 Til: linux clustering Emne: Re: [linux-cluster] multipath issue... Smells of hardware issue. On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer Lippert wrote: > Hi, > > I have a setup with two identical RX200s3 FuSi servers talking to a SAN > (SX60 + extra controller), and that works fine with gfs1. > > I do however see some errors on one of the servers. It's in my message log > and only now and then now and then (though always under load, but i cant > load it and thereby force it to give the error). > > The error says: > Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed > Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active > paths: 1 > Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code = > 0x00070000 > Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector > 705160231 > Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path 8:16. > Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path is > up > Jun 28 15:44:22 app02 multipathd: 8:16: reinstated > Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active > paths: 2 > Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed > Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active > paths: 1 > Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code = > 0x00070000 > Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector > 739870727 > Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path 8:32. > Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path is > up > Jun 28 15:46:06 app02 multipathd: 8:32: reinstated > Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active > paths: 2 > > To me i looks like a fiber that bounces up and down. (There is no switch > involved). > > Sometimes i only get a slightly shorter version: > Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code = > 0x00070000 > Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector > 2782490295 > Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path 8:16. > Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed > Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active > paths: 1 > Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path is > up > Jun 29 09:04:37 app02 multipathd: 8:16: reinstated > Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active > paths: 2 > > Any sugestions, but start swapping hardware? It's possible that your scsi device is timing out the scsi read command from the readsector0 path checker, which is what it appears that your setup is using to check the path status. This checker has it's timeout set to 5 minutes, but I suppose that it is possible to take this long if your hardware is a flaky. If you're willing to recompile the code, you can change this default by changing DEF_TIMEOUT in libcheckers/checkers.h. DEF_TIMEOUT is the scsi command timeout in milliseconds. Otherwise, if you are only seeing this on one server, swapping hardware seems like a reasonable thing to try. -Ben > Mvh / Kind regards > > Kristoffer Lippert > Systemansvarlig > JP/Politiken A/S > Online Magasiner > > Tlf. +45 8738 3032 > Cell. +45 6062 8703 > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From manjusc13 at rediffmail.com Fri Jul 6 09:58:44 2007 From: manjusc13 at rediffmail.com (manjunath c shanubog) Date: 6 Jul 2007 09:58:44 -0000 Subject: [Linux-cluster] Mysql installation on Cluster Message-ID: <1183621964.S.4215.12946.webmail51.rediffmail.com.old.1183715924.5775@webmail.rediffmail.com> Hi Mehmet,       Thanx for ur reply !           If it is possible i will go for active-active, otherwise active-passive.         Can u send me the steps to follow the cluster installation with fencing device and Mysql cluster installtion on the cluster.         Which fencing device is better ?Thanking YOuManjunath SC            On Thu, 05 Jul 2007 07:20:07 +0000 linux clustering wrotehii manjunath,how will you work the mysql cluster ? I know two way for the mysql-cluster.1. active-passive (failover)2. active-activeyou don\'t active-active, because it\'s not be with RHCS. for this, you have to visit mysql.com>From: \"manjunath c shanubog\" >Reply-To: linux clustering >To: >Subject: [Linux-cluster] Mysql installation on Cluster>Date: 4 Jul 2! 007 11:35:31 -0000>>Hi,      I need complete installation guide for >installing cluster using redhat EL 5, and Mysql installtion guide on >the cluster.      which fencing device >is better whether APC 9120 or NPS 230 from Western telematic.Thanking >YouManjunath     >-->Linux-cluster mailing list>Linux-cluster at redhat.com>https://www.redhat.com/mailman/listinfo/linux-cluster_________________________________________________________________http://liveearth.msn.com--Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From suvankar_moitra at yahoo.com Fri Jul 6 09:32:40 2007 From: suvankar_moitra at yahoo.com (SUVANKAR MOITRA) Date: Fri, 6 Jul 2007 02:32:40 -0700 (PDT) Subject: SV: [linux-cluster] multipath issue... Smells of hardware issue. In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7A55@exchsrv07.rootdom.dk> Message-ID: <328099.69094.qm@web52002.mail.re2.yahoo.com> hi , Pl install the device driver in failover mode. regards Suvankar --- Kristoffer Lippert wrote: > Hi, > > Thank you very much for the explaination. > > The hardware should under no circumstances take 5 > minutes to perform a readsector. Not even when the > command queue is very long. > I've tried copying files to and from the SAN, and > i've tried a little program called sys_basher > working the disks continously since last Friday. > (almost a week) and i have not been able to > reproduce the error. Before i could produce it > within an hour by copying files. > I've only seen the error on one server, and i've > changed nothing. (well, obvouisly something must > have changed since the error seems to be gone.) > > I get a throughput of about 120mb/sec on the san > using GFS1. It's fast enough for my use (wich is > large files for a website). Is it far below expected > throughput? > > Kind regards > Kristoffer > > > > > -----Oprindelig meddelelse----- > Fra: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] P? vegne > af Benjamin Marzinski > Sendt: 5. juli 2007 22:01 > Til: linux clustering > Emne: Re: [linux-cluster] multipath issue... Smells > of hardware issue. > > On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer > Lippert wrote: > > Hi, > > > > I have a setup with two identical RX200s3 FuSi > servers talking to a SAN > > (SX60 + extra controller), and that works fine > with gfs1. > > > > I do however see some errors on one of the > servers. It's in my message log > > and only now and then now and then (though > always under load, but i cant > > load it and thereby force it to give the > error). > > > > The error says: > > Jun 28 15:44:17 app02 multipathd: 8:16: mark as > failed > > Jun 28 15:44:17 app02 multipathd: > main_disk_volume1: remaining active > > paths: 1 > > Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI > error: return code = > > 0x00070000 > > Jun 28 15:44:17 app02 kernel: end_request: I/O > error, dev sdb, sector > > 705160231 > > Jun 28 15:44:17 app02 kernel: device-mapper: > multipath: Failing path 8:16. > > Jun 28 15:44:22 app02 multipathd: sdb: > readsector0 checker reports path is > > up > > Jun 28 15:44:22 app02 multipathd: 8:16: > reinstated > > Jun 28 15:44:22 app02 multipathd: > main_disk_volume1: remaining active > > paths: 2 > > Jun 28 15:46:02 app02 multipathd: 8:32: mark as > failed > > Jun 28 15:46:02 app02 multipathd: > main_disk_volume1: remaining active > > paths: 1 > > Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI > error: return code = > > 0x00070000 > > Jun 28 15:46:02 app02 kernel: end_request: I/O > error, dev sdc, sector > > 739870727 > > Jun 28 15:46:02 app02 kernel: device-mapper: > multipath: Failing path 8:32. > > Jun 28 15:46:06 app02 multipathd: sdc: > readsector0 checker reports path is > > up > > Jun 28 15:46:06 app02 multipathd: 8:32: > reinstated > > Jun 28 15:46:06 app02 multipathd: > main_disk_volume1: remaining active > > paths: 2 > > > > To me i looks like a fiber that bounces up and > down. (There is no switch > > involved). > > > > Sometimes i only get a slightly shorter > version: > > Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI > error: return code = > > 0x00070000 > > Jun 29 09:04:32 app02 kernel: end_request: I/O > error, dev sdb, sector > > 2782490295 > > Jun 29 09:04:32 app02 kernel: device-mapper: > multipath: Failing path 8:16. > > Jun 29 09:04:32 app02 multipathd: 8:16: mark as > failed > > Jun 29 09:04:32 app02 multipathd: > main_disk_volume1: remaining active > > paths: 1 > > Jun 29 09:04:37 app02 multipathd: sdb: > readsector0 checker reports path is > > up > > Jun 29 09:04:37 app02 multipathd: 8:16: > reinstated > > Jun 29 09:04:37 app02 multipathd: > main_disk_volume1: remaining active > > paths: 2 > > > > Any sugestions, but start swapping hardware? > > It's possible that your scsi device is timing out > the scsi read command from the readsector0 path > checker, which is what it appears that your setup is > using to check the path status. This checker has > it's timeout set to 5 minutes, but I suppose that it > is possible to take this long if your hardware is a > flaky. If you're willing to recompile the code, you > can change this default by changing DEF_TIMEOUT in > libcheckers/checkers.h. DEF_TIMEOUT is the scsi > command timeout in milliseconds. > > Otherwise, if you are only seeing this on one > server, swapping hardware seems like a reasonable > thing to try. > > -Ben > > > Mvh / Kind regards > > > > Kristoffer Lippert > > Systemansvarlig > > JP/Politiken A/S > > Online Magasiner > > > > Tlf. +45 8738 3032 > > Cell. +45 6062 8703 > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ____________________________________________________________________________________ Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/ From dan.deshayes at algitech.com Fri Jul 6 14:36:15 2007 From: dan.deshayes at algitech.com (Dan Deshayes) Date: Fri, 06 Jul 2007 16:36:15 +0200 Subject: [Linux-cluster] IP Relocate Error / IP Restart error In-Reply-To: References: Message-ID: <468E535F.2080606@algitech.com> Hello, I'm bumping this question since I'm experienceing a smiliar problem. When one of my services fails and the cluster is trying to restart it, the node withdraws the ip and route. It seems that it can't setup the ip again when it has withdrawn. It can failover between nodes which holds other ipnumbers though, but never back except when I manully puts back the ip and route. I don't want to relocate the service just if sms-pixie fails but only to restart it (its stops when it looses connection to a server). I'm using bond and my configuration looks like this: