[Linux-cluster] gfs_grow

Wed Oct 8 20:56:35 UTC 2008

Hi all,

I'd like to thank Bob Peterson for helping me solve the last problem I  
was seeing with my storage cluster.  I've got a new one now.  A couple  
days ago, site ops plugged in a new storage shelf and this triggered  
some sort of error in the storage chassis.  I was able to sort that  
out with gfs_fsck, and have since gotten the new storage recognized by  
the cluster.  I'd like to make use of this new storage, and it's here  
that we run into trouble.

lvextend completed with no trouble, so I ran gfs_grow.  gfs_grow has  
been running for over an hour now and has not progressed past:

[root at s12n01 ~]# gfs_grow /dev/s12/scratch13
FS: Mount Point: /scratch13
FS: Device: /dev/s12/scratch13
FS: Options: rw,noatime,nodiratime
FS: Size: 4392290302
DEV: Size: 5466032128
Preparing to write new FS information...

The load average on this node has risen from its normal ~30-40 to 513  
(the number of nfsd threads, plus one), and the file system has become  
slow-to-inaccessible on client nodes.  I am seeing messages in my log  
files that indicate things like:

Oct  8 16:26:00 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when  
sending 140 bytes - shutting down socket
Oct  8 16:26:00 s12n01 last message repeated 4 times
Oct  8 16:26:00 s12n01 kernel: nfsd: peername failed (err 107)!
Oct  8 16:26:58 s12n01 kernel: nfsd: peername failed (err 107)!
Oct  8 16:27:56 s12n01 last message repeated 2 times
Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when  
sending 140 bytes - shutting down socket
Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when  
sending 140 bytes - shutting down socket
Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when  
sending 140 bytes - shutting down socket
Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104 when  
sending 140 bytes - shutting down socket
Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
Oct  8 16:28:34 s12n01 last message repeated 2 times
Oct  8 16:30:29 s12n01 last message repeated 2 times

I was seeing similar messages this morning, but those went away when I  
mounted this file system on another node in the cluster, turned on  
statfs_fast, and then moved the service to that node.  I'm not sure  
what to do about it given that gfs_grow is running.  Is this something  
anyone else has seen?  Does anyone know what to do about this?  Do I  
have any option other than to wait until gfs_grow is done?  Given my  
recent experiences (see "lm_dlm_cancel" in the list archives), I'm  
very hesitant to hit ^C on this gfs_grow.  I'm running CentOS 4 for  
x86-64, kernel 2.6.9-67.0.20.ELsmp.

Thanks,

James