[Linux-cluster] gfs_grow

Thu Oct 9 15:18:11 UTC 2008

Thanks Andrew.

What I'm really hoping for is anything I can do to make this gfs_grow  
go faster.  It's been running for 19 hours now, I have no idea when  
it'll complete, and the file system I'm trying to grow has been all  
but unusable for the duration.  This is a very busy file system, and I  
know it's best to run gfs_grow on a quiet file system, but there isn't  
too much I can do about that.  Alternatively, if anyone knows of a  
signal I could send to gfs_grow that would cause it to give a status  
report or increase verbosity, that would be helpful, too.  I have  
tried both increasing and decreasing the number of NFS threads, but  
since I can't tell where I am in the process or how quickly it's  
going, I have no idea what effect this has on operations.

Thanks,

James

On Oct 8, 2008, at 5:12 PM, Andrew A. Neuschwander wrote:

> James,
>
> I have a CentOS 5.2 cluster where I would see the same nfs errors  
> under certain conditions. If I did anything that introduced latency  
> to my gfs operations on the node that served nfs, the nfs threads  
> couldn't service requests faster than they came in from clients.  
> Eventually my nfs threads would all be busy and start dropping nfs  
> requests. I kept an eye on my nfsd thread utilization (/proc/net/rpc/ 
> nfsd) and kept bumping up the number of threads until they could  
> handle all the requests while the gfs had a higher latency.
>
> In my case, I had EMC Networker streaming data from my gfs  
> filesystems to a local scsi tape device on the same node that served  
> nfs. I eventually separated them onto different nodes.
>
> I'm sure gfs_grow would slow down your gfs enough that your nfs  
> threads couldn't keep up. NFS on gfs seems to be very latency  
> sensitive. I have a quick an dirty perl script to generate a  
> historgram image from nfs thread stats if you are interested.
>
> -Andrew
> --
> Andrew A. Neuschwander, RHCE
> Linux Systems/Software Engineer
> College of Forestry and Conservation
> The University of Montana
> http://www.ntsg.umt.edu
> andrew at ntsg.umt.edu - 406.243.6310
>
>
> James Chamberlain wrote:
>> Hi all,
>> I'd like to thank Bob Peterson for helping me solve the last  
>> problem I was seeing with my storage cluster.  I've got a new one  
>> now.  A couple days ago, site ops plugged in a new storage shelf  
>> and this triggered some sort of error in the storage chassis.  I  
>> was able to sort that out with gfs_fsck, and have since gotten the  
>> new storage recognized by the cluster.  I'd like to make use of  
>> this new storage, and it's here that we run into trouble.
>> lvextend completed with no trouble, so I ran gfs_grow.  gfs_grow  
>> has been running for over an hour now and has not progressed past:
>> [root at s12n01 ~]# gfs_grow /dev/s12/scratch13
>> FS: Mount Point: /scratch13
>> FS: Device: /dev/s12/scratch13
>> FS: Options: rw,noatime,nodiratime
>> FS: Size: 4392290302
>> DEV: Size: 5466032128
>> Preparing to write new FS information...
>> The load average on this node has risen from its normal ~30-40 to  
>> 513 (the number of nfsd threads, plus one), and the file system has  
>> become slow-to-inaccessible on client nodes.  I am seeing messages  
>> in my log files that indicate things like:
>> Oct  8 16:26:00 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104  
>> when sending 140 bytes - shutting down socket
>> Oct  8 16:26:00 s12n01 last message repeated 4 times
>> Oct  8 16:26:00 s12n01 kernel: nfsd: peername failed (err 107)!
>> Oct  8 16:26:58 s12n01 kernel: nfsd: peername failed (err 107)!
>> Oct  8 16:27:56 s12n01 last message repeated 2 times
>> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104  
>> when sending 140 bytes - shutting down socket
>> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104  
>> when sending 140 bytes - shutting down socket
>> Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
>> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104  
>> when sending 140 bytes - shutting down socket
>> Oct  8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104  
>> when sending 140 bytes - shutting down socket
>> Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
>> Oct  8 16:28:34 s12n01 last message repeated 2 times
>> Oct  8 16:30:29 s12n01 last message repeated 2 times
>> I was seeing similar messages this morning, but those went away  
>> when I mounted this file system on another node in the cluster,  
>> turned on statfs_fast, and then moved the service to that node.   
>> I'm not sure what to do about it given that gfs_grow is running.   
>> Is this something anyone else has seen?  Does anyone know what to  
>> do about this?  Do I have any option other than to wait until  
>> gfs_grow is done?  Given my recent experiences (see "lm_dlm_cancel"  
>> in the list archives), I'm very hesitant to hit ^C on this  
>> gfs_grow.  I'm running CentOS 4 for x86-64, kernel  
>> 2.6.9-67.0.20.ELsmp.
>> Thanks,
>> James
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20081009/5b0e4836/attachment.htm>