[Linux-cluster] errors with GFS2 and DRBD. Please help..

Sat Mar 13 12:58:31 UTC 2010

Hi,
Thanks for replying.
The log is not from booting. I had initiated a sync with invalidate as the other server was stable at that point in time for 19 hours.. ( It also had problem again today afternoon ). GFS is mounted after both drbd primary ( both through rc.local with a && condition). Also I forgot to mention it but ever since we started having these problems I have stopped the GNBD import on the third node..

I suspect file system corruption due to the UPS failure when the servers power went off abruptly. But the gfs2 fsck should have resolved it. The only way I see of solving it is to take a backup of file system with rsync to anather disk, format and re-create GFS2 partitions then restore data. Pitfall is possible freq restarts due to GFS2 withdrawls.

Please advise..

 With warm regards
Koustubha Kale

________________________________
From: Kaloyan Kovachev <kkovachev at varna.net>
To: linux clustering <linux-cluster at redhat.com>
Sent: Sat, 13 March, 2010 4:58:28 PM
Subject: Re: [Linux-cluster] errors with GFS2 and DRBD. Please help..

Hi,
i have similar setup, but with iscsi instead of GNBD and when the DRBD
devices are exported as fileio instead of blockio there are similar problems
caused from the read caching. Check if you have some caching from GNBD and
disable it. I don't think it is related to the UPS, but see below

On Sat, 13 Mar 2010 14:31:31 +0530 (IST), Koustubha Kale wrote 
> Hi all, 
> We have a three node GFS2 cluster on a CentOS 5.4 output of uname -a is
"Linux IMSTermServer4.vpmthane.org 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30
EDT 2009 x86_64 x86_64 x86_64 GNU/Linux" 
> Its setup as a two node primary-primary drbd cluster drbd version: 8.3.7
(api:88/proto:86-91). Two LVM's are created on top of drbd as shown below in
the drbd-overview output. 
> 
> 10:r0  Connected Primary/Primary UpToDate/UpToDate C r---- lvm-pv: Fac1
400.00G 400.00G 
>  11:r1  Connected Primary/Primary UpToDate/UpToDate C r---- lvm-pv: Stu1
491.20G 491.20G 
> 
> Inter connect between nodes is with a dedicated Gigabit switch. 
> 
> Third node imports the above two file systems throu GNBD. 
> 
> The setup was working fine for several months when one day we had a UPS
failure. Ever since then we have frequent we have very frequent GFS2 errors
and file system withdrawls, nodes restarting. The error in log is as shown
below.. 
> 
>   
> Mar 13 11:12:40 IMSTermServer5 kernel: block drbd10: Resync done (total 4
sec; paused 0 sec; 12288 K/sec) 
> Mar 13 11:12:40 IMSTermServer5 kernel: block drbd10: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate ) 
> Mar 13 11:12:43 IMSTermServer5 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.0:
fatal: filesystem consistency error 
> Mar 13 11:12:43 IMSTermServer5 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.0:   RG
= 26343101 
> Mar 13 11:12:43 IMSTermServer5 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.0:  
function = gfs2_setbit, file = fs/gfs2/rgrp.c, line = 97 
> Mar 13 11:12:43 IMSTermServer5 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.0:
about to withdraw this file system 
> Mar 13 11:12:43 IMSTermServer5 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.0:
telling LM to withdraw 
> Mar 13 11:13:08 IMSTermServer5 kernel: block drbd11: Resync done (total 32
sec; paused 0 sec; 10112 K/sec) 
> Mar 13 11:13:08 IMSTermServer5 kernel: block drbd11: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate ) 
> 

this is from DRBD resynchronization and if this log is when booting it is
partially normal, but if it is during operation, then you have some network
problems - check the switch and the cables.
Also if the log is from booting, it seems you have GFS2 mounted before the
resync is completed - you should wait for all DRBD resources to became
Primary/Primary before you start the cluster or else the corruption is guaranteed.

> OR 
> 
> Mar  8 13:23:02 IMSTermServer4 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.1:
fatal: invalid metadata block 
> Mar  8 13:23:02 IMSTermServer4 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.1:   
bh = 26216898 (magic number) 
> Mar  8 13:23:02 IMSTermServer4 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.1:  
function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 334 
> Mar  8 13:23:02 IMSTermServer4 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.1:
about to withdraw this file system 
> Mar  8 13:23:02 IMSTermServer4 kernel: GFS2: fsid=NEW_BRIMS:Gfs2Stu1.1:
telling LM to withdraw 
> 
> The errors are much more frequent on the Stu volume. I dont want to lose any
data. 
> I have tried running  fsck.gfs2 on both servers on both Fac and Stu volumes
( un-mounted of course ) several times, I have tried updating all cluster and
cluster storage related rpm's, I have updated to recent stable drbd 8.3.7 from
drbd-8.3.7rc1 which was installed earlier ( and which worked fine for several
months )  but the problems persist.  
> 
> Any ideas how I can resolve this issues? 
>  With warm regards 
> Koustubha Kale 
> 
> 
-----------------------------------------------------------------------
The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

      Your Mail works best with the New Yahoo Optimized IE8. Get it NOW! http://downloads.yahoo.com/in/internetexplorer/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100313/9ead7bc6/attachment.htm>