[Linux-cluster] NFS on GFS architectural issues / problems

Riaan van Niekerk riaan at obsidian.co.za
Mon Aug 21 14:41:12 UTC 2006


Wendy Cheng wrote:
> Riaan van Niekerk wrote:
> 
>>
>> My question to you or anyone who is familiar with NFS on GFS, or GFS 
>> in general, which of the following are still valid issues for the 
>> current (6.1u4) version of GFS. If all or most of them still apply, I 
>> can use this as motivation for my customer to strongly consider going 
>> off NFS on GFS. Removing the NFS from our GFS cluster has been on the 
>> cards for quite a while, but has not gained momentum due to lack of 
>> information on the performance gains of such a move (very difficult to 
>> gage) or the architectural problems/limitations of NFS on GFS (for 
>> which the following extract is spot-on).
> 
> 
> These have been worked on and some of them do have test patches ready to 
> address the issues. However, the changes are non-trivial and may involve 
> base kernel modifiction that we need to get upstream (community linux 
> kernel)  acceptance. The efforts take time since we would like to do it 
> conservatively to preserve GFS1/2 stability. Unless the posted problems 
> have urgent needs (let us know), the current NFS-GFS development focus 
> is on failover (Red Hat bugzilla 132823).
> 
> Is performance the primary concern you have now ?
> 
> -- Wendy

Yes, mostly. We have a couple of open service requests for stability. 
They are very intermittent and not reproduceable (and nothing in 
bugzilla seems to match):

a) load average on nodes steadily climbs until load average reaches the 
nfsd count, upon which all I/O hangs. We reboot nodes one by one, and as 
soon as the one with a stuck lock is bounced, I/O returns to all nodes)

b) kernel oopses with Assertion failed on line 428 / 357 of dlm/lock.c 
while there is no load on the system . this happens 3 days in a row, 
over a weekend, and then for weeks, the error does not occur again.

getting the info that upport requires (sysrq t, lockdump, etc, on all 
nodes, crashdump on failing node, is pretty difficult). We are not 
married to NFS on GFS, even though it is a cost-effective interim step 
for until we can get all our mail servers (14 in all) SAN-attached.

Can I read into "have been worked on" and "some do have test patches" 
that these 4 issues still persist? I need the ammunition to motivate the 
move away from NFS on GFS. this architecture document gives it to me if 
these issues are still valid.

tnx
Riaan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060821/6891ed38/attachment.vcf>


More information about the Linux-cluster mailing list