[Linux-cluster] GFS problem

Claudio Tassini claudio.tassini at gmail.com
Fri Nov 24 11:15:44 UTC 2006


Hi all,

I have a two-nodes cluster. Everytime I shutdown one of the cluster nodes,
the console of the other node prints out these errors:
SCSI error : <1 0 1 1> return code = 0x20000
end_request: I/O error, dev sde, sector 33433224
device-mapper: dm-multipath: Failing path 8:64.
SCSI error : <1 0 1 1> return code = 0x20000
end_request: I/O error, dev sde, sector 33886548
SCSI error : <1 0 1 1> return code = 0x20000
...........
device-mapper: dm-multipath: Failing path 8:112.
SCSI error : <2 0 1 1> return code = 0x20000
end_request: I/O error, dev sdh, sector 342386776
Buffer I/O error on device diapered_dm-3, logical block 85596598
end_request: I/O error, dev sdh, sector 342386780
Buffer I/O error on device diapered_dm-3, logical block 85596599
Buffer I/O error on device diapered_dm-3, logical block 85596600
Buffer I/O error on device diapered_dm-3, logical block 85596601
..........
GFS: fsid=notartel:not-net.0: fatal: I/O error
GFS: fsid=notartel:not-net.0:   block = 1712284
GFS: fsid=notartel:not-net.0:   function = gfs_dreread
GFS: fsid=notartel:not-net.0:   file =
/usr/src/build/765946-x86_64/BUILD/gfs-kernel-2.6.9-58/smp/src/gfs/dio.c,
line = 576
GFS: fsid=notartel:not-net.0:   time = 1164365987
GFS: fsid=notartel:not-net.0: about to withdraw from the cluster
GFS: fsid=notartel:not-net.0: waiting for outstanding I/O
..........
Buffer I/O error on device diapered_dm-3, logical block 85596582
GFS: fsid=notartel:not-net.0: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=notartel:not-net.0: withdrawn
GFS: Trying to join cluster "lock_dlm", "notartel:not-it"
GFS: fsid=notartel:not-it.0: Joined cluster. Now mounting FS...
GFS: fsid=notartel:not-it.0: jid=0: Trying to acquire journal lock...
GFS: fsid=notartel:not-it.0: jid=0: Looking at journal...
GFS: fsid=notartel:not-it.0: jid=0: Done
GFS: fsid=notartel:not-it.0: jid=1: Trying to acquire journal lock...
GFS: fsid=notartel:not-it.0: jid=1: Looking at journal...
GFS: fsid=notartel:not-it.0: jid=1: Done
GFS: fsid=notartel:not-it.0: jid=2: Trying to acquire journal lock...
GFS: fsid=notartel:not-it.0: jid=2: Looking at journal...
GFS: fsid=notartel:not-it.0: jid=2: Done
GFS: Trying to join cluster "lock_dlm", "notartel:not-net"
GFS: fsid=notartel:not-net.0: Joined cluster. Now mounting FS...
GFS: fsid=notartel:not-net.0: jid=0: Trying to acquire journal lock...
GFS: fsid=notartel:not-net.0: jid=0: Looking at journal...
GFS: fsid=notartel:not-net.0: jid=0: Acquiring the transaction lock...
GFS: fsid=notartel:not-net.0: jid=0: Replaying journal...
GFS: fsid=notartel:not-net.0: jid=0: Replayed 533 of 1995 blocks
GFS: fsid=notartel:not-net.0: jid=0: replays = 533, skips = 293, sames =
1169
GFS: fsid=notartel:not-net.0: jid=0: Journal replayed in 1s
GFS: fsid=notartel:not-net.0: jid=0: Done
GFS: fsid=notartel:not-net.0: jid=1: Trying to acquire journal lock...
GFS: fsid=notartel:not-net.0: jid=1: Looking at journal...
GFS: fsid=notartel:not-net.0: jid=1: Done
GFS: fsid=notartel:not-net.0: jid=2: Trying to acquire journal lock...
GFS: fsid=notartel:not-net.0: jid=2: Looking at journal...
GFS: fsid=notartel:not-net.0: jid=2: Done
GFS: fsid=notartel:not-net.0: Scanning for log elements...
GFS: fsid=notartel:not-net.0: Found 6 unlinked inodes
GFS: fsid=notartel:not-net.0: Found quota changes for 2 IDs
GFS: fsid=notartel:not-net.0: Done


and services on that node do a restart.
The topology is as follows:
2 SunFire X4200 Servers, each equipped with 2 Qlogic (Sun) HBAs which lspci
show as:
Fibre Channel: QLogic Corp. QLA6322 Fibre Channel Adapter (rev 03)

connected via two FC Switches SANBOX2 (always from qlogic) to a Sun StorEdge
3510 RAID Array.

The cluster configuration is made of a mail service which mounts three GFS
filesystems, then starts postfix and courier-imap.

It seems that the problem is when the qlogic driver (qla6312) gets
loaded-unloaded. I managed to reproduce the problem doing a modprobe -r
qla6312 / modprobe qla6312: immediately the other node starts whit scsi
errors until GFS filesystems hang and are whitdraw.

Any idea if this can be a GFS fault or only a matter of drivers? and if the
latter, which mailing list should I post for it?

Thanks in advance

-- 
Claudio Tassini
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061124/7c409c0a/attachment.htm>


More information about the Linux-cluster mailing list