[Linux-cluster] lock_gulm heartbeat
Michael Conrad Tadpol Tilstra
mtilstra at redhat.com
Thu Feb 17 16:20:44 UTC 2005
Edward Mann wrote:
> Everyone,
>
>
> I have a gfs cluster that has been running fine for about 3 months now.
> I am only using 2 machines and the storage is a firewire drive. Over the
> weekend i started to get:
> lock_gulmd_core[1100]: Failed to receive a timely heartbeat reply from
> Master. (t:1108583425506998 mb:1)
>
> and after 2, which is what i allowed for the missed heart beats, the gfs
> slave would die. I moved the missed heartbeat up to 5 and have seen it
> miss as many as 4 in a row. The only thing that has changed on the
> machine is that i add new clients to process files once they are placed
> on the machine. I am using FAM to notify my app that a new file is
> present.
>
> Any ideas on what i should look at? How can i diagnose this problem. The
> communication between the two machines seems fine. I can ping both
> hosts. I am really at a loss at to what to look for.
It sounds like it might be a load problem. The node is busy doing other
work and it doesn't give enough time to the gulm core process to do
heartbeats. You should try increacing the heartbeat_rate some. And
maybe nicing the lock_gulmd processes some.
If you want to try tacking all of the heartbeat messages, set the
'heartbeat' verbosity flag.
--
Michael Conrad Tadpol Tilstra
AH! Get off my leg! You're Not my type!!!!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050217/bc02ec9f/attachment.sig>
More information about the Linux-cluster
mailing list