Ext3 corruption using cluster

Sat May 9 08:58:28 UTC 2009

On May 09, 2009  10:25 +0200, Stefano Cislaghi wrote:
> Maybe... looking around some solutions can be:
> - maximize journal size
> - journaling all data and metadata (mount -o data=journal)

No, these have nothing to do with your problem.  If you are running
in a failover environment you need to STONITH the failing server
BEFORE the backup server is trying to take over.

> 
> Ste
> 
> 
> 2009/5/9 Christian Kujau <lists at nerdbynature.de>
> 
> > On Thu, 7 May 2009, Stefano Cislaghi wrote:
> > > During a normal switch, operations done are:
> > > - oracle shutdown abort
> > > - oracle listernet shutdown
> > > - umount fs (using  umount -l )

Using "umount -l" is just a way to NOT unmount the filesystem,
because some process is keeping it busy.  All this does is hide
the mountpoint until the busy process goes away.  Definitely a
bad sign that you need this for doing any failover.

Try "lsof" to see which process is keeping the mountpoint busy.
At minimum these need to be stopped/killed and then do a proper
unmount.

> > I'm not all too Oracle cluster savvy, but this lazy umount looks
> > kinda suspicious. From the manpage:
> >
> >   > Detach the filesystem from the filesystem hierarchy now, and cleanup
> >   > all references to the filesystem as soon as it is not busy anymore
> >
> > My wild guess: node1 has been shut down, did a lazy umount, so that
> > node2 could mount it but node1 was still writing to the fs (i.e. it was
> > still in use)?
> >
> > Christian.
> > --
> > Bruce Schneier's first program was encrypt world.
> >

> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.