[Linux-cluster] Cluster Suite v3 software watchdog

Tue Jan 3 19:01:39 UTC 2006

depends on arch and if standard kernel.org is modified.  Kernel.org does
the following:
x86_64 has nmi_watchdog default to off
i386 has nmi watchdog default to on

no other arches have nmi watchdog that I am aware of.

The nmi watchdog simply prints out a backtrace when interrupts are off
for too long.  This occurs because of a buggy software driver or kernel
code that clears interrupts on a processor and doesn't reenable them.
Hence, the nmi watchdog is not fed, and it triggers a stack backtrace
(instead of a total lockup) which allows someone experienced in
development to find the source of the offending lock and fix the kernel
code.

I really doubt if you are using any commercial vendor kernel with
supplied drivers you will encounter this sort of failure; this feature
is generally used during development of kernel code.

Some vendor kernels do special things when an nmi watchdog occurs, like
take a system memory dump and then reboot, to allow debugging of the
crash by the vendor at a later time.

Regards
-steve

On Wed, 2005-12-21 at 16:50 -0200, Celso K. Webber wrote:
> Hi Lon,
> 
> Thank you very much for your reply. I'll try your tips.
> 
> Now another question: is it really necessary to pass on the 
> "nmi_watchdog=1" parameter to the kernel? Or is it enabled by default 
> under RHELv3 ou v4?
> 
> Regards,
> 
> Celso.
> 
> Lon Hohberger escreveu:
> 
> >On Wed, 2005-12-21 at 16:25 -0200, Celso K. Webber wrote:
> >
> >  
> >
> >>Does anyone has had this issue before? Or am I missing any step on 
> >>configuring the software watchdog feature?
> >>
> >>Another question for the Red Hat people on the list: does this "software 
> >>watchdog" works ok? I ask because it's enabled by default when you add a 
> >>new member to the cluster. The Cluster Suite v3 manual tells nothing 
> >>about this resource either.
> >>    
> >>
> >
> >Yes, it works fine.
> >
> >A few things could be happening:
> >
> >(1) The NMI watchdog will reboot the machine if it detects an NMI hang.
> >This is only a few seconds.
> >
> >(2) The cluster is extremely paranoid because you are not using a
> >STONITH device (power controller), and it's detecting internal hangs.
> >Try increasing the failover time.
> >
> >(3) The cluster is not getting scheduled due to system load.  See the
> >man page for cludb(8) about clumembd%rtp - both may help.
> >
> >
> >-- Lon
> >  
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster