[Linux-cluster] Periodic hang of file system accesses using GFS/GNBD (gnbd (pid 12082: du) got signal 1)

Mon May 15 20:13:37 UTC 2006

Hi, I have a two-node cluster where each node exports filesystems to the
other node, e.g.

nodeA:
  2tb array /dev/sdc
  LVM PV/VG/LV created
  /dev/nodea_sdc_vg/lvol0 mounted on /array/nodea
  /dev/sdc is exported via gnbd
  nodeb gnbd (/dev/sdc) device is imported
  /dev/nodeb_sdc_vg/lvol0 mounted on /array/nodeb

nodeB:
  2tb array /dev/sdc
  LVM PV/VG/LV created
  /dev/nodeb_sdc_vg/lvol0 mounted on /array/nodeb
  /dev/sdc is exported via gnbd
  nodea gnbd (/dev/sdc) device is imported
  /dev/nodea_sdc_vg/lvol0 mounted on /array/nodea

Everything seemed to work fine when I set it up. I ran some bonnie++
tests with pretty vigorous settings, on each node against it's local GFS
and on each node against the remote GFS, and the same simultaneously,
everything worked fine.

I've now put 200+gb of data on it and I'm encountering the problem where
normal processes like find, du, or ls hang against nodeb's array while
on nodea. Messages like the following appear in the dmesg on nodea (note
that I have not used kill on any of these processes, so I'm not kill
-9'ing them to get this):

gnbd (pid 12082: du) got signal 9
gnbd0: Send control failed (result -4)
gnbd0: Receive control failed (result -32)
gnbd0: shutting down socket
exitting GNBD_DO_IT ioctl
resending requests
gnbd (pid 12082: du) got signal 1
gnbd0: Send control failed (result -4)
gnbd (pid 20598: find) got signal 9
gnbd0: Send control failed (result -4)
gnbd (pid 4238: diff) got signal 9
gnbd0: Send control failed (result -4)
gnbd0: Receive control failed (result -32)
gnbd0: shutting down socket
exitting GNBD_DO_IT ioctl
resending requests

Looking at the code with my limited knowledge of kernel programming, it
looks like this means that a SIGKILL/SIGSEGV got trapped during the
sock_sendmsg/sock_recvmsg? It's pretty easy to get this problem to manifest.

I can clear the hang by doing gndb_export -O -R on the server (nodeb)
and reexport. The client (nodea) automatically picks up the
disconnect/reconnect and SIGKILL's the hung process.

After this has happened a bunch of times, it looks like the GFS has got
a little corrupted -- I ran a gfs_fsck -y -v on it and it cleaned up a
bunch of fsck bitmap mismatches.

It doesn't look like network connectivity is being lost at all between
the two nodes, but I can't be absolutely sure a single packet didn't get
dropped here or there.

Any help would be greatly appreciated!

-Ross

Vital statistics of the systems (both are running identical kernel +
GFS/GNBD/CMAN/etc modules, compiled on one and copied to the other)

Linux nodea 2.6.12.6 #2 SMP Fri Apr 14 19:59:14 EDT 2006 i686 i686 i386
GNU/Linux

cman-kernel-2.6.11.5-20050601.152643.FC4
dlm-kernel-2.6.11.5-20050601.152643.FC4
gfs-kernel-2.6.11.8-20050601.152643.FC4
gnbd-kernel-2.6.11.2-20050420.133124.FC4

Both boxes are dual xeons 2.8ghz with 4gb ram each (but with the BIOS
memory mapping issue that prevents us from seeing all 4gb, so really
3.3gb). The arrays are SATA arrays on top of Areca cards -- one box has
dual ARC-1120's and the other has a single ARC-1160 split up using LVM.