[Linux-cluster] Cluster Crashes

Ryan O'Hara rohara at redhat.com
Mon Nov 20 19:47:53 UTC 2006


isplist at logicore.net wrote:
 >
> First of all, is there a way I can test to see if my Brocade switch is 
> actually doing any fencing or not? I get the sense it's doing nothing.
> 
> I think this because my cluster is terribly unstable. If I reboot a node, 
> that's fine, it works, the cluster stays up. However, if one of the nodes 
> crashes in any manner, it takes down everything to the point of having to shut 
> down every machine and starting it all one at a time.
> 
> If a drive get's moved on my FC storage, the cluster crashes. If the storage 
> is rebooted, the cluster crashes. If I change pretty much anything on the 
> storage, the cluster crashes, it's nuts. The way it seems to start is that one 
> node seems to have a kernel panic which sets off the rest.
> 
> I know this is limited information but I need somewhere to start. I can't even 
> begin to think of using this in a production environment, no one would get any 
> sleep watching over this to make sure it's all up :).
> 
> Mike

This almost sounds like the RSCN problem I tried to chase down a while 
back. In a nutshell, something changes on the SAN and an RSCN event 
occurs, which is seen by all nodes on the SAN. The RSCN event should be 
completely harmless, but I have seen it kill all the FC I/O paths, and 
that would be bad. I would think that the cluster would stay up, but 
nodes would withdraw from the filesystem as soon as they lost the I/O path.

Are you using Qlogic HBAs? If so, check /var/log/messages for any "SCSI 
errors".

What you are seeing could be unrelated, but the symptoms sounds roughly 
the same.

Ryan




More information about the Linux-cluster mailing list