[Linux-cluster] Re: GFS on md on shared disks?

Thu Oct 7 18:18:09 UTC 2004

On Thu, Oct 07, 2004 at 12:07:57PM -0400, Ed L Cashin wrote:
> Erling Nygaard <nygaard at redhat.com> writes:
> 
> > No, this will not work at all.
> >
> > All GFS locking is done on a filesystem level. In order to make this work 
> > you need locking on the blocksystem level .
> 
> I guess I'm looking for a concrete reason why it won't work.  I've
> been assuming it won't work, but I can't think of a concrete reason.

The reason non-cluster-aware software RAID5 won't work is because the
parity blocks aren't locked correctly.  Let's take a case where there
are 3 disks and look at the contents of one stripe:

                    Disk0       Disk1        Disk2
                +-----------+-----------+-----------+
....            |           |           |           |
Stripe 12       | inode #23 | inode #24 | parity    |
....            |           |           |           |
                +-----------+-----------+-----------+

Suppose Node A writes inode 23 and Node B writes inode 24 (both at the
same time).  The following sequence of events could occur:

1)  Node A locks inode 23 exclusively
2)  Node B locks inode 24 exclusively
3)  Node A starts writing inode 23.  This consists of:
    A) Reading the inode off of Disk 0
    B) Reading the parity block off of Disk 2
    C) XORing the old version of the Disk 0 block out of the Disk 2 block
    D) XORing the new version of the Disk 0 block into the Disk 2 block
4)  Node B starts writing inode 24.  This consists of:
    A) Reading the inode off of Disk 1
    B) Reading the parity block off of Disk 2
    C) XORing the old version of the Disk 1 block out of the Disk 2 block
    D) XORing the new version of the Disk 1 block into the Disk 2 block
5)  Node A completes writing inode 23.  This consists of:
    A) Writing the new block to Disk 0
    A) Writing the new parity block to Disk 2
6)  Node A completes writing inode 24.  This consists of:
    A) Writing the new block to Disk 1 
    A) Writing the new parity block to Disk 2

The problem is that you had two simultaneous read-modify-write operations
on the parity block.  Neither operation took the other one into account.
So, the data in the non-parity blocks is correct, but the parity block is
now corrupt.  As long as you don't lose a disk, you're fine.  But, as soon
as a disk dies, the values you'll get from reading inode 23 and 24 will
be completely bogus.

A cluster aware software RAID5 implementation would lock stripes so that
only one machine could modify a stripe at a time.

-- 
Ken Preslan <kpreslan at redhat.com>