[Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of 11 nodes?

Wed Mar 21 00:53:19 UTC 2007

On Tue, Mar 20, 2007 at 08:36:59PM -0400, rhurst at bidmc.harvard.edu wrote:
> I ran a series of reboots, and this problem is totally reproducible.  Should I be opening a ticket at Red Hat Support on this?
> 
> The problem is immediate with 'service rgmanager stop', as it hangs in its sleep loop forever, even though all nodes in the cluster report that it changed its state to down.  But worse than that, it also hangs all GFS I/O and the load average on all nodes start to spike (>9.00) -- I see gfs_scand in top racing away.
> 
> It only gets fixed when I manually 'power reset' the node, then I get the 'Missed too many heartbeats' followed by fencing.  Help.
> 
>

echo "RGMGR_OPTS=-d" > /etc/sysconfig/cluster

and reproduce and then open a ticket with support.  Its possible that it's
waiting for one of your service scripts to stop and it's not returning.  Also
there was a bug where bash would segfault and rgmanager would just hang.  Make
sure you have the newest version of bash and see if the problem still
reproduces.  If none of the above helps definitely file a support ticket, if
frontline cannot figure it out it will probably make it back to me and I'll take
a look.

Josef