[Linux-cluster] Cluster service restarting Locally

Tue Mar 14 03:58:41 UTC 2006

Dear Mr. Hohberger  

Thanks for the replay.

When running only one node the frequency of restart is very less, but it happens with the same symtoms

The machines are HP DL380G3 (2 node) with MSA SAN 1000 storage.
The load average is around 4.
The cluster is primarly for running postgresql database of around 116 GB size

Saju John

On Mon, 13 Mar 2006 Lon Hohberger wrote :
>On Sat, 2006-03-11 at 10:50 +0000, saju john wrote:
> >
> > Dear Mr. Hohberger,
> >
> > Thanx for the replay.
> >
> > I saw your comments for the problem I reported. ie lock traffic is
> > getting network-starved.
>
>It could be getting I/O starved too, which might explain more given that
>this seems to happen on one node.  When running just one node and the
>service restarts, are the symptoms the same?  Does it report these kinds
>of errors, or are they different?
>
>[quote from your previous mail]
>clusvcmgrd[1388]: <err> Unable to obtain cluster lock: Connection
>timed out
>clulockd[1378]: <warning> Denied A.B.C.D: Broken pipe
>clulockd[1378]: <err> select error: Broken pipe
>[/quote]
>
>If they're different in the one-node case, what are the errors?  Also,
>are there any other errors in the logs?
>
>
> > My assumption is that, the problem is due to some curruption of meta
> > data information writing to the quroum partition ,as both nodes
> > writing to quroum cuncurrently.
>
>I really doubt that.  In the case of lock information, only one node
>writes at a time anyway...
>
> >  May be due to bug in the rawdeivce driver.I am not sure.Then
> > interesting question is ,how the cluster worked all these days(for me
> > around one year with out any major problem).
>
>The odds of random, block-level corruption going undetected when reading
> from the raw partitions is low - between (2^32):1 and (2^96):1 against
>per block, based on internal consistency checks that clumanager
>performs.  My math might be a little off, but it requires two randomly
>correct 32-bit magic numbers and one randomly valid 32-bit CRC, with
>other data incorrect to cause a problem.
>
>Specifically in the lock case, a lock block which passed all of the
>consistency checks but was *actually* corrupt would almost always cause
>clulockd to crash.
>
>Timeout errors mean that clulockd didn't respond to a request in a given
>amount of time, and can be caused by either network saturation or poor
>raw I/O performance to shared storage.  It looks like it's getting to an
>incoming request too late...
>
>
> > Could you pelase consider this also when releasing the RHCS3U7.
>
>If this is a critical issue for you, then you should file a ticket with
>Red Hat Support if you have not already done so:
>
>    http://www.redhat.com/apps/support/
>
>If you think this is a bug, you can also file a Bugzilla, and we will
>get to it when we can:
>
>    http://bugzilla.redhat.com/bugzilla/
>
>-- Lon
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060314/b3691fd5/attachment.htm>