[Linux-cluster] error messages while use fsck.gfs2

Wed Nov 16 12:14:36 UTC 2011

Hi,

On Wed, 2011-11-16 at 12:00 +0000, Alan Brown wrote:
> On Wed, 16 Nov 2011, Steven Whitehouse wrote:
> 
> > The problem is the blocks following that, such as the master directory
> > which contains all the system files. If enough of that has been
> > destroyed, it would make it very tricky to reconstruct. Even so it might
> > be possible depending on exactly which blocks are damaged and what is
> > known about the original fs.
> 
> Why can't this be mirrored at the end of the partition/fs?
> 
Because some of those items are updated during normal fs operation and
it would dramatically reduce performance if we had to update multiple
places on disk.

This is no different to any other filesystem. There is an argument for
having multiple superblock copies, which probably wouldn't be too tricky
to add, but we've not bothered so far simply because it is generally
very easy to reconstruct.

It contains the pointers to the root and master inodes, plus the fs
label and uuid and thats basically it.

> > The real question is how those blocks became overwritten in the first
> > place. However, if there is some other process which has overwritten
> > part of the disk there is very little that the fs can do,
> 
> It's most likely to be something external that's done it, but IMO critical
> metadata really should be duplicated elsewhere on the FS to aid recovery.
> 
> AB
> 
We cannot reasonably guard against other processes doing something they
ought not to, directly to the device, by duplicating metadata. This is
why processes have permissions associated with them - they should not
have access to the device if they are not trusted. If the issue is
device reliability, then that should be taken care of at the device
level, using RAID.

We cannot reasonably guard against everything a sysadmin does (I'm not
saying that this was the case here, but something has gone wrong and it
doesn't look like it probably happened via the filesystem) either.

Having a backup of the system is the only real solution in this case,

Steve.