[Linux-cluster] Re: GFS on md on shared disks?
Ken Preslan
kpreslan at redhat.com
Thu Oct 7 18:18:09 UTC 2004
On Thu, Oct 07, 2004 at 12:07:57PM -0400, Ed L Cashin wrote:
> Erling Nygaard <nygaard at redhat.com> writes:
>
> > No, this will not work at all.
> >
> > All GFS locking is done on a filesystem level. In order to make this work
> > you need locking on the blocksystem level .
>
> I guess I'm looking for a concrete reason why it won't work. I've
> been assuming it won't work, but I can't think of a concrete reason.
The reason non-cluster-aware software RAID5 won't work is because the
parity blocks aren't locked correctly. Let's take a case where there
are 3 disks and look at the contents of one stripe:
Disk0 Disk1 Disk2
+-----------+-----------+-----------+
.... | | | |
Stripe 12 | inode #23 | inode #24 | parity |
.... | | | |
+-----------+-----------+-----------+
Suppose Node A writes inode 23 and Node B writes inode 24 (both at the
same time). The following sequence of events could occur:
1) Node A locks inode 23 exclusively
2) Node B locks inode 24 exclusively
3) Node A starts writing inode 23. This consists of:
A) Reading the inode off of Disk 0
B) Reading the parity block off of Disk 2
C) XORing the old version of the Disk 0 block out of the Disk 2 block
D) XORing the new version of the Disk 0 block into the Disk 2 block
4) Node B starts writing inode 24. This consists of:
A) Reading the inode off of Disk 1
B) Reading the parity block off of Disk 2
C) XORing the old version of the Disk 1 block out of the Disk 2 block
D) XORing the new version of the Disk 1 block into the Disk 2 block
5) Node A completes writing inode 23. This consists of:
A) Writing the new block to Disk 0
A) Writing the new parity block to Disk 2
6) Node A completes writing inode 24. This consists of:
A) Writing the new block to Disk 1
A) Writing the new parity block to Disk 2
The problem is that you had two simultaneous read-modify-write operations
on the parity block. Neither operation took the other one into account.
So, the data in the non-parity blocks is correct, but the parity block is
now corrupt. As long as you don't lose a disk, you're fine. But, as soon
as a disk dies, the values you'll get from reading inode 23 and 24 will
be completely bogus.
A cluster aware software RAID5 implementation would lock stripes so that
only one machine could modify a stripe at a time.
--
Ken Preslan <kpreslan at redhat.com>
More information about the Linux-cluster
mailing list