[Linux-cluster] gfs_grow

Tue Aug 28 14:30:32 UTC 2007

On Tue, 2007-08-28 at 10:08 +0100, Ben Yarwood wrote:
> I am using a 3 Node cluster using RHEL4U4.
> 
> I ran a gfs_grow yesterday on one of our filesystems but stupidly missed a process that was using the same file system.  The grow
> process hung and when I got it to exit, the file system is now reporting as having grown to the larger size but no extra space has
> appeared.  Basically my file system grew from 14TB to 15TB and my usage also grew from 13TB to 14TB.
> 
> Does anyone know if it's possible to get this space back?  I know I could probably do as gfs_fsck but given the size of the file
> system, this would take a few days according to some previous reports.
> 
> Thanks
> Ben

Hi Ben,

The fact that there was a process using the file system shouldn't have
been a problem and gfs_grow should have been able to work around it.
It would have been interesting to see where gfs_grow was "hung" but it's
too late for that now.  My guess is that you killed gfs_grow before it
was able to update the resource group index properly.

In RHEL4U4 there is a feature to gfs_fsck to change and repair damaged
RGs and RG indexes.  Things get tricky for the code once the file system
has been extended though, so although you probably don't want to hear
this, you should probably make a backup of your data first, just to be
safe.

Running gfs_fsck will take a while on a file system that big, but it
depends on the speed of your hardware.  I'd expect it to take less than
a day to complete.  If you can't afford the down time, it might be
helpful to know that the RG repair is done before any of the passes, so
in theory you could probably try to use it to repair the RGs and then
kill the gfs_fsck.  Newer versions of gfs_fsck will catch <ctrl-c>
interrupts and give you options to skip around parts, but I don't think
that's in RHEL4U4 (I think it got into RHEL4.5).

So I guess my recommendation is:

1. Make a backup of your data
2. Wait until most people have gone home for the day
3. Unmount the file system from ALL nodes.
4. Run gfs_fsck.
5. Watch the gfs_fsck output for messages about finding and fixing
   RG damage just so you know it did something.
6. Let gfs_fsck run overnight.
7. If you need the file system back and it's still running by morning,
   you could kill it manually.  It would be better to let it run, but
   it shouldn't do any harm to kill it prematurely if necessary.
8. Remount the file system and see if df shows the correct values.

I hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite