[Linux-cluster] Re: rgmanager dieing with no messages [was: Re: SMP and GFS]

Eric Kerin eric at bootseg.com
Wed Dec 14 19:31:11 UTC 2005


On Wed, 2005-10-05 at 17:08 -0400, Lon Hohberger wrote:
> On Mon, 2005-10-03 at 11:23 -0400, Eric Kerin wrote:
> > On Sun, 2005-10-02 at 11:06 -0400, DeadManMoving wrote:
> > > My cluster is highly instable, just this morning i've realized that
> > > the clurgmgrd deamon was dead...
> > 
> > I'm having this same problem on my cluster, I've been planning on
> > enabling core dumps for rgmanager once I find a few minutes to restart
> > the cluster services. With any luck, that will be today.
> 
> If you see anything, let me know.  There's a segfault I'm trying to
> track down which this is... I haven't been able to reproduce it
> internally :(
> 

I finally got the downtime to enable core dumps, and just noticed that
rgmanager crashed (not hung in the segfault loop).   After looking at
this a bit, this problem is becoming quite strange to me.

I don't have any nfs exports in my cluster.conf file, so I don't think
that bug applies.  But I am seeing really strange data in the backtraces
(below)  Similar to
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166109

The thing is, this is a stock RHEL4 U1 Kernel (2.6.9-11.ELsmp)  On 64
bit capable Xeon processors, but running on a 32 bit kernel.

I can compress the core dump I have and send it, if you like, or run any
commands with gdb (and the like) needed.


Thanks, 
Eric




[root at auhjpsn01a ~]# gdb /usr/sbin/clurgmgrd
GNU gdb Red Hat Linux (6.3.0.0-0.31rh)
<SNIP LICENSE+STUFF>
This GDB was configured as "i386-redhat-linux-gnu"...Using host
libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) core /core.2707
Core was generated by `clurgmgrd'.
Program terminated with signal 11, Segmentation fault.
#0  0x006bb5e9 in ?? ()

(gdb) thr a a bt
Thread 4 (process 2707):
#0  0x006427a2 in ?? ()
Cannot access memory at address 0xbff3dbcc

Thread 3 (process 3917):
#0  0x006427a2 in ?? ()
Cannot access memory at address 0xb75e4318

Thread 2 (process 10987):
#0  0x006427a2 in ?? ()
Cannot access memory at address 0xb4bff28c

Thread 1 (process 10986):
#0  0x006bb5e9 in ?? ()
#1  0x00000000 in ?? ()





More information about the Linux-cluster mailing list