[Linux-cluster] Freeze with cluster-2.03.11
Kadlecsik Jozsef
kadlec at mail.kfki.hu
Fri Mar 27 06:47:06 UTC 2009
On Fri, 27 Mar 2009, Ben Yarwood wrote:
> Replaying a journal as below usually idicates a node has withdrawn from that
> file system I believe. You should grep messages on all nodes for 'GFS', if
> any node is repoting errors with this fs then it will need rebooting/fencing
> before access to that fs can be achieved.
The failining node is fenced off. Here are the steps to reproduce the
freeze of the node:
- all nodes are running and member of the cluster
- start the mailman queue manager: the node freezes
- the freezed node fenced off by a member of the cluster
- I can see log messages as I wrote in my first mail:
Mar 26 23:09:24 lxserv1 kernel: dlm: closing connection to node 1
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Trying to
acquire journal lock...
[...]
- sometimes (but not always) the fencing machine freezes as well
and then therefore fenced off
- third node has never freezed so far and the cluster thus remained
in quorum
- fenced off machines restarted, join the cluster and work until I start
the mailman queue manager
The daily backups of the whole GFS file systems are completed, so I assume
it's not a filesystem corruption.
Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
More information about the Linux-cluster
mailing list