[Linux-cluster] Re: rgmanager dieing with no messages [was: Re: SMP and GFS]
Eric Kerin
eric at bootseg.com
Wed Dec 14 19:31:11 UTC 2005
On Wed, 2005-10-05 at 17:08 -0400, Lon Hohberger wrote:
> On Mon, 2005-10-03 at 11:23 -0400, Eric Kerin wrote:
> > On Sun, 2005-10-02 at 11:06 -0400, DeadManMoving wrote:
> > > My cluster is highly instable, just this morning i've realized that
> > > the clurgmgrd deamon was dead...
> >
> > I'm having this same problem on my cluster, I've been planning on
> > enabling core dumps for rgmanager once I find a few minutes to restart
> > the cluster services. With any luck, that will be today.
>
> If you see anything, let me know. There's a segfault I'm trying to
> track down which this is... I haven't been able to reproduce it
> internally :(
>
I finally got the downtime to enable core dumps, and just noticed that
rgmanager crashed (not hung in the segfault loop). After looking at
this a bit, this problem is becoming quite strange to me.
I don't have any nfs exports in my cluster.conf file, so I don't think
that bug applies. But I am seeing really strange data in the backtraces
(below) Similar to
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166109
The thing is, this is a stock RHEL4 U1 Kernel (2.6.9-11.ELsmp) On 64
bit capable Xeon processors, but running on a 32 bit kernel.
I can compress the core dump I have and send it, if you like, or run any
commands with gdb (and the like) needed.
Thanks,
Eric
[root at auhjpsn01a ~]# gdb /usr/sbin/clurgmgrd
GNU gdb Red Hat Linux (6.3.0.0-0.31rh)
<SNIP LICENSE+STUFF>
This GDB was configured as "i386-redhat-linux-gnu"...Using host
libthread_db library "/lib/tls/libthread_db.so.1".
(gdb) core /core.2707
Core was generated by `clurgmgrd'.
Program terminated with signal 11, Segmentation fault.
#0 0x006bb5e9 in ?? ()
(gdb) thr a a bt
Thread 4 (process 2707):
#0 0x006427a2 in ?? ()
Cannot access memory at address 0xbff3dbcc
Thread 3 (process 3917):
#0 0x006427a2 in ?? ()
Cannot access memory at address 0xb75e4318
Thread 2 (process 10987):
#0 0x006427a2 in ?? ()
Cannot access memory at address 0xb4bff28c
Thread 1 (process 10986):
#0 0x006bb5e9 in ?? ()
#1 0x00000000 in ?? ()
More information about the Linux-cluster
mailing list