[Linux-cluster] gnbd problem

Thu Jun 7 14:07:23 UTC 2007

Hi,

               I am running CentOS 4.4 with the cluster suite in a GNBD 
+ GFS solution. The dual onboard nics are bonded in alb on the client nodes
and the gnbd server has 4 nics bonded on alb. The GNBD server has a 
3ware 16 channel controller on raid 6. The network aggregate throughput 
is great and so is the
performance on GFS. This GFS installation is replacing my current Lustre 
installation.

               Here is the problem - On heavy load like copying lots of 
very big files or dd in a loop (from many hpc nodes simultaneously), i 
get the following error messages -

gnbd (pid 5296: cp) got signal 9
gnbd0: Send control failed (result -4)
gnbd0: Send data failed (result -104)
gnbd0: Receive control failed (result -32)
gnbd0: shutting down socket
exitting GNBD_DO_IT ioctl
resending requests
 
                 One time with iozone, the gfs mount froze on all nodes. 
But once i disabled the oopes ok option in mount, that problem seems to 
have gone away. Hopefully
if there is an oops, that node will panic and won't freeze gfs for the 
rest of the nodes. If i update the gnbd from 1.0.8 to 1.0.9, will it fix 
the gnbd error messages on heavy load?
The clients are using GNBD fencing. I am a bit concerned that the gnbd 
client when re-opening connection with gnbd server could cause 
corruption of data or freeze of gfs mount.
Is that a possibility? Since the other parts of cluster suite are 
working fine and it is just the gnbd client that is having problems, the 
gnbd fencing probably won't kick in.

                I have another related question. During  a power outage, 
the hpc nodes would shutdown in 5 minutes and only the master node and 
storage server would run on the ups with battery pack
for another hour or so. The master node re-exports the gfs mount via nfs 
to our infrastructure servers. In this scenario, when all the hpc nodes 
are down, the cluster loses quorum. Will the gfs mount on the master 
node freeze when the cluster loses quorum? If it does, is there a way 
around it, like maybe lots of votes for master node for example? In 
Lustre, this scenario is possible. I can have a single server with 
mounted lustre volumes still up with all the other nodes down due to a 
power outage. Thanks very much.

Balagopal Pillai