[Linux-cluster] Cluster Crashes
Ryan O'Hara
rohara at redhat.com
Mon Nov 20 19:47:53 UTC 2006
isplist at logicore.net wrote:
>
> First of all, is there a way I can test to see if my Brocade switch is
> actually doing any fencing or not? I get the sense it's doing nothing.
>
> I think this because my cluster is terribly unstable. If I reboot a node,
> that's fine, it works, the cluster stays up. However, if one of the nodes
> crashes in any manner, it takes down everything to the point of having to shut
> down every machine and starting it all one at a time.
>
> If a drive get's moved on my FC storage, the cluster crashes. If the storage
> is rebooted, the cluster crashes. If I change pretty much anything on the
> storage, the cluster crashes, it's nuts. The way it seems to start is that one
> node seems to have a kernel panic which sets off the rest.
>
> I know this is limited information but I need somewhere to start. I can't even
> begin to think of using this in a production environment, no one would get any
> sleep watching over this to make sure it's all up :).
>
> Mike
This almost sounds like the RSCN problem I tried to chase down a while
back. In a nutshell, something changes on the SAN and an RSCN event
occurs, which is seen by all nodes on the SAN. The RSCN event should be
completely harmless, but I have seen it kill all the FC I/O paths, and
that would be bad. I would think that the cluster would stay up, but
nodes would withdraw from the filesystem as soon as they lost the I/O path.
Are you using Qlogic HBAs? If so, check /var/log/messages for any "SCSI
errors".
What you are seeing could be unrelated, but the symptoms sounds roughly
the same.
Ryan
More information about the Linux-cluster
mailing list