[Linux-cluster] managing GFS corruption on large FS

Wed Nov 29 16:04:47 UTC 2006

Riaan van Niekerk wrote:
> hi all
>
> We have a large GFS consisting of 4 TB of maildir data. We have a 
> corruption on this GFS which causes nodes to be withdrawn intermittently.
>
> The cause of the fs corruption is due to user error and lack of 
> documentation (initially not having the clustered flag enabled on the 
> VG when growing the LV/GFS). We now know better, and will avoid this 
> particular cause of corruption. However, management wants to know from 
> us how we can prevent corruption, or minimize the downtime incurred if 
> this should happen again.
>
> For the current problem, since a gfs_fsck will take too long (we 
> cannot afford the 1 - 3 days of downtime it will take to complete the 
> fsck), we are planning to migrate the data to a new GFS, and at the 
> same time set up the new environment optimally to cause the minimum of 
> downtime, if a corruption were to happen again.
>
> One option is to split the one big GFS into a number of smaller GFS's. 
> Unfortunately, our environment does not lean itself to being split up 
> in (for example) a number of 200GB GFS's. Also, this negates a lot of 
> the advantages of GFS (e.g. having your storage consolidated onto one 
> big GFS, and scaling it out by growing the GFS and adding nodes).
>
> I would really like to know how others on this list manage the 
> threat/risk of FS corruption, and the corruption itself, if it does 
> happen. Also, w.r.t. data protection, if you do snapshots, SAN-based 
> mirroring, backup to disk/tape, I  would really appreciate it if you 
> could give me detail information like
> a) mechanism (e.g snaps, backup, etc)
> b) type of data (e.g. many small files)
> c) size of GFS
> d) the time it takes to perform the action
>
> thank you
> Riaan
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Hi Riaan,

You've raised a good question, and I thought I'd address some of your 
issues.
I'm just throwing these out in no particular order.

Running gfs_fsck is understandably slow, but there are a few things to bear
in mind:

1. A 4TB file system is not excessive by any means.  As I stated in the 
cluster
   faq, a customer reported running gfs_fsck on a 45TB and it only took 48
   hours, and that was slower than it should have been because it was 
running
   out of memory and started swapping to disk.  Your 4TB file system should
   take a lot less time since it's a tenth of the size.  That depends, 
of course, on
   hardware issues as well.  See:

http://sources.redhat.com/cluster/faq.html#gfs_fsck1

2. I've recently figured out a couple of ways to improve the speed
    of gfs_fsck.  For example, for a recent bugzilla, I patched a memory 
leak
    and combined passes through the file system inside the duplicate 
checking
    code, pass1b.  For a list of improvements, see this bugzilla, especially
    comment #33:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208836

    I think this should be available in Rhel4 U5.

3. gfs_fsck takes a lot of memory to run, and when it runs out of memory,
    it will start swapping to disk, and that will slow it down considerably.
    So be sure to run it on a system with lots of memory.

4. We're continuing to improve the gfs_fsck code all the time.
    Jon Brassow and I have done some brainstorming and hope to keep
    making it faster.  I've come up with some more memory saving ideas
    that might make it faster, but I have yet to try them out.  Maybe soon.

5. Another thing that slows down gfs_fsck is running it in verbose mode.
    Sometimes it's useful to have the verbose mode, but it will slow you
    down considerably.  Don't use -v or -vv unless you have to.
    If you're only using -v to figure out where fsck is in the process, 
I have
    a couple of improvements:  In the most recent version of gfs_fsck (for
    the bugzilla above) I've added more "% complete" messages.  Also, if
    you interrupt that version by hitting <ctrl-c> it will tell you what 
block
    it's currently working on and allow you to continue.  Again, I think 
this
    should be in RHEL4 U5.

6. I recently discovered an issue that impacts GFS performance for large
    file systems, not only for gfs_fsck but for general performance as well.
    The issue has to do with the size of the GFS resource groups, which is
    an internal GFS structure for managing the data.  This is an internal
    GFS structure, not to be confused with rgmanager's Resource Groups.
    Some file system slowdown can be blamed on having a large number
    of RGs.  The bigger your file system, the more RGs you need.  By 
default,
    gfs_mkfs carves your file system into 256MB RGs, but it allows you to
    specify a preferred RG size.  The default, 256MB, is good for average
    size file systems, but you can increase performance on a bigger file
    system by using a bigger RG size.  For example, my 40TB file system
    requires approximately 156438 RGs of 256MB each.  Whenever GFS
    has to run that linked list, it takes a long time.  The same 40TB 
file system
    can be created with bigger RGs--2048MB--requiring only 19555 of them.
    The time savings is dramatic: It took nearly 23 minutes for my system
    to read in all 156438 RG Structures (with 256MB RGs), but only 4
    minutes to read in the 19555 RG Structures for my 2048MB RGs.
    The time to do an operation like df on an empty file system dropped from
    24 seconds with 256MB RGs, to under a second with 2048MB RGs.
    I'm sure that increasing the size of the RGs would help gfs_fsck's
    performance as well.  I can't make any performance promises; I can only
    tell you what I observed in this one case.  The issue is documented 
in this
    bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213763

    I'm going to try to see if I can get a KnowledgeBase article written up
    about this, by the way, and I'll try to put something into the FAQ too.

    For RHEL5, I'm changing gfs_mkfs so that it picks a more intelligent
    RG size based on the file system size, to let users take advantage 
of this
    performance benefit without ever knowing or caring about the RG size.

    Unfortunately, there's no way to change the RG size once a file system
    has been made.  It only happens at gfs_mkfs time.

7. As for file system corruption, that's a tough issue.  First of all, 
it's very
    rare. In virtually all the cases I've seen it was caused by 
influences outside
    of GFS itself, like the case you mentioned:  (1) someone swapping a
    hard drive that resided in the middle of a GFS logical volume, (2) 
someone
    running gfs_fsck while the volume was still mounted by a node, or (3)
    someone messing with the SAN from a machine outside of GFS.
    If there are other ways to cause GFS file corruption, we need the users
    to open bugzillas up so we can work on the problem, and even so, it's
    nearly impossible to tell how corruption occurs unless it can be
    recreated here in our lab.

I'm going to continue to search for ways to improve the performance of
GFS and gfs_fsck because you're right:  the needs of our users are 
increasing
and people are using bigger and bigger file systems all the time.

Regards,

Bob Peterson
Red Hat Cluster Suite