[Linux-cluster] gfs_grow experience

Wed Aug 23 11:34:36 UTC 2006

Yesterday evening we grew our 6-node 2.4 TB GFS 6.1 filesystem to 4.5 TB.

Here is our experience, which I hope others can benefit from

Having grown the underlying LUN (on an EMC CX500) a couple of weeks ago, 
we got bit by this parted bug:  Parted segfaults because of extended 
devices with GPT partition table
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=194238
(for which we just got a hotfix from RH after extensive testing of test 
packages)

The partitioning and LVM all went smoothly. gfs_grow -T (test) showed 
nothing funny. When we started the real gfs_grow, things started out 
smoothly. At about 7 - 10 minutes, the GFS was withdrawn from 5 of the 
nodes (the only one not withdrawing was the one on which the gfs_grow 
was running):

Aug 23 00:58:10 host kernel: GFS: fsid=webmail:gfs_mail.4: jid=5: Trying 
to acquire journal lock..
.
Aug 23 00:58:10 host kernel: GFS: fsid=webmail:gfs_mail.4: jid=5: Busy
Aug 23 00:58:29 host kernel: attempt to access beyond end of device
...
Aug 23 00:58:29 host kernel: attempt to access beyond end of device
Aug 23 00:58:29 host kernel: dm-0: rw=0, want=7803155736, limit=5044961280
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: fatal: I/O error
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   block = 
975394466
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   function = 
gfs_dreread
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   file = 
/usr/src/build/765788-x86_64/B
UILD/gfs-kernel-2.6.9-58/smp/src/gfs/dio.c, line = 576
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   time = 
1156287509
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: about to 
withdraw from the cluster
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: waiting for 
outstanding I/O
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: telling LM to 
withdraw
Aug 23 00:58:32 host kernel: lock_dlm: withdraw abandoned memory
Aug 23 00:58:32 host kernel: GFS: fsid=webmail:gfs_mail.4: withdrawn

Definitely not a nice message to see for something as suspenseful as a 
gfs_grow, which you cannot rollback, and interrupting/resuming seems to 
be not recommended. Even more so since the fs needs to be mounted for it 
to be grown. While the GFS is being grown, I/O to the fs is blocked. The 
only way I could get an idea that the gfs_grow was still busy doing 
something, was to run strace on its PID.

After 16 very long minutes, the grow completed. the GFS on 2 of the 
nodes could be brought back by a simple 'service gfs restart'. The 
others had to be bounced.  After 30 minutes of everything being up, the 
2 nodes also lost the FS with the same error message as above and had to 
be bounced.

When I disabled quotas (we were still in our maintenance window) , I 
mistakenly ran the command 'gfs_tool settune /mnt/san quota_account 0' 
on more than one node since the quota value was not updated quickly 
enough on other nodes after I ran it on the first node. The FS was 
withdrawn again on 2 nodes, with error:

Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: fatal: filesystem
consistency error
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   function =
trans_go_xmote_bh
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   file =
/usr/src/build/765787-i686/BUI
LD/gfs-kernel-2.6.9-58/smp/src/gfs/glops.c, line = 542
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   time =
1156290932
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: about to withdraw
from the cluster
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: waiting for
outstanding I/O
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: telling LM to
withdraw
Aug 23 01:55:37 host kernel: lock_dlm: withdraw abandoned memory
Aug 23 01:55:37 host kernel: GFS: fsid=webmail:gfs_mail.0: withdrawn

After bouncing them, all seemed well. To Red Hat: would it make sense to 
log bugzillas for these withdraw scenarios (what seems to be bugs in 
gfs_grow and gfs_tune/quota, unless the withdraw on gfs_grow works as 
intended and/or despite the latter probably being pebcak / incorrect 
usage)? I will not be able to easily replicate and we are fine now 
(hopefully) despite these hickups. (e.g. I have no reason to open 
Service Requests) I am sure others might run into these aswell.

greetings
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060823/e48b68d2/attachment.vcf>