[Linux-cluster] Cluster Suite v3 software watchdog
Steven Dake
sdake at mvista.com
Tue Jan 3 19:01:39 UTC 2006
depends on arch and if standard kernel.org is modified. Kernel.org does
the following:
x86_64 has nmi_watchdog default to off
i386 has nmi watchdog default to on
no other arches have nmi watchdog that I am aware of.
The nmi watchdog simply prints out a backtrace when interrupts are off
for too long. This occurs because of a buggy software driver or kernel
code that clears interrupts on a processor and doesn't reenable them.
Hence, the nmi watchdog is not fed, and it triggers a stack backtrace
(instead of a total lockup) which allows someone experienced in
development to find the source of the offending lock and fix the kernel
code.
I really doubt if you are using any commercial vendor kernel with
supplied drivers you will encounter this sort of failure; this feature
is generally used during development of kernel code.
Some vendor kernels do special things when an nmi watchdog occurs, like
take a system memory dump and then reboot, to allow debugging of the
crash by the vendor at a later time.
Regards
-steve
On Wed, 2005-12-21 at 16:50 -0200, Celso K. Webber wrote:
> Hi Lon,
>
> Thank you very much for your reply. I'll try your tips.
>
> Now another question: is it really necessary to pass on the
> "nmi_watchdog=1" parameter to the kernel? Or is it enabled by default
> under RHELv3 ou v4?
>
> Regards,
>
> Celso.
>
> Lon Hohberger escreveu:
>
> >On Wed, 2005-12-21 at 16:25 -0200, Celso K. Webber wrote:
> >
> >
> >
> >>Does anyone has had this issue before? Or am I missing any step on
> >>configuring the software watchdog feature?
> >>
> >>Another question for the Red Hat people on the list: does this "software
> >>watchdog" works ok? I ask because it's enabled by default when you add a
> >>new member to the cluster. The Cluster Suite v3 manual tells nothing
> >>about this resource either.
> >>
> >>
> >
> >Yes, it works fine.
> >
> >A few things could be happening:
> >
> >(1) The NMI watchdog will reboot the machine if it detects an NMI hang.
> >This is only a few seconds.
> >
> >(2) The cluster is extremely paranoid because you are not using a
> >STONITH device (power controller), and it's detecting internal hangs.
> >Try increasing the failover time.
> >
> >(3) The cluster is not getting scheduled due to system load. See the
> >man page for cludb(8) about clumembd%rtp - both may help.
> >
> >
> >-- Lon
> >
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list