[Linux-cluster] GFS Failure

Ben Yarwood ben.yarwood at juno.co.uk
Tue Nov 21 01:31:17 UTC 2006


I have a three node cluster running the latest RHEL4U4 which has been
running seamlessly until this evening when one of the gfs file systems had a
problem and failed totally on node, jrmedia-a.  Executing any command
resulted in an IO Error.  The resource manager noticed this failure but
could not relocate the corresponding service to another node.

Can anyone shed some light on what happened?  I have unmounted and the
remounted the file system on the node and stopped and started the service
and everything seemed to return to normal.

Here are the relevant log messages from all three nodes.  

jrmedia-a

Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:
fatal: filesystem consistency error
Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:   RG
= 18652911
Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:
function = gfs_setbit
Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:
file = /builddir/build/BUILD/gfs-kernel-2.6.9-60/smp/src/gfs/bits.c, line =
71
Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:
time = 1164066728
Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: about
to withdraw from the cluster
Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:
waiting for outstanding I/O
Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:
telling LM to withdraw
Nov 20 23:52:11 jrmedia-a kernel: lock_dlm: withdraw abandoned memory
Nov 20 23:52:11 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0:
withdrawn
Nov 20 23:52:47 jrmedia-a clurgmgrd[4938]: <notice> status on clusterfs
"customersfs" returned 1 (generic error)
Nov 20 23:52:47 jrmedia-a clurgmgrd[4938]: <notice> Stopping service
customers
Nov 20 23:52:48 jrmedia-a clurgmgrd: [4938]: <info> Removing IPv4 address
10.0.20.56 from eth1
Nov 20 23:52:58 jrmedia-a clurgmgrd: [4938]: <err> /mnt/customers is not a
directory
Nov 20 23:52:58 jrmedia-a clurgmgrd[4938]: <notice> stop on nfsclient
"read-write" returned 2 (invalid argument(s))
Nov 20 23:52:59 jrmedia-a clurgmgrd[4938]: <crit> #12: RG customers failed
to stop; intervention required
Nov 20 23:52:59 jrmedia-a clurgmgrd[4938]: <notice> Service customers is
failed

jrmedia-b

Nov 20 23:52:09 jrmedia-b kernel: GFS: fsid=alpha_cluster:customers.1:
jid=0: Trying to acquire journal lock...
Nov 20 23:52:09 jrmedia-b kernel: GFS: fsid=alpha_cluster:customers.1:
jid=0: Busy
Nov 20 23:53:58 jrmedia-b kernel: dlm: customers: process_lockqueue_reply id
7a601e3 state 0



jrmedia-c

Nov 20 23:52:09 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: Trying to acquire journal lock...
Nov 20 23:52:09 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: Looking at journal...
Nov 20 23:52:09 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: Acquiring the transaction lock...
Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: Replaying journal...
Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: Replayed 0 of 8 blocks
Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: replays = 0, skips = 3, sames = 5
Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: Journal replayed in 1s
Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2:
jid=0: Done



Ben






More information about the Linux-cluster mailing list