[Linux-cluster] Cluster with shared storage on low budget

Tue Feb 15 02:37:39 UTC 2011

On 02/14/2011 09:23 PM, Nikola Savic wrote:
>> I have an in-progress tutorial, which I would recommend as a guide only.
>> If you are interested, I will send you the link off-list.
>>
>> As for your question; No, you can read/write to the shared storage at
>> the same time without the need for iSCSI. DRBD can run in
>> "Primary/Primary[/Primary]" mode. Then you layer onto this clustered LVM
>> followed by GFS2. Once up, all three nodes can access and edit the same
>> storage space at the same time.
>>
>> So you're taking advantage of all three technologies. As for mirrored
>> LVM, I've not tried it yet as DRBD->cLVM->GFS2 has worked quite well for me.
> 
>   I just read about Primary/Primary configuration in DRBD's User Guide,
> but would love to get link to tutorial you mentioned, especially if it
> covers fancing :) When one of servers is restarted and there is delay in
> data being written to DRBD, what happens when sever is back up? Is
> booting stopped by DRBD until synchronization is done, or does it try to
> do it in background? If it's done in background, how does
> Primary/Primary mode work?
> 
>   Thanks,
>   Nikola

Once the cluster manager (corosync in Cluster3, openais in Cluster2)
stops getting messages from a node (be it hung or dead), it starts a
counter. Once the counter exceeds a set threshold, the node is declared
dead and a fence is called against that node. This should, when working
properly, reliably prevent the node from trying to access the shared
storage (ie: stop it from trying to complete a write operation).

Once, and *only* if the fence was successful, the cluster will reform.
Once the cluster configuration is in place, recovery of the file system
can begin (ie: the journal can be replayed). Finally, normal operation
can continue, albeit with one less node. This is also where the resource
manager (rgmanager or pacemaker) start shuffling around any resources
that were lost when the node went down.

Traditionally, fencing involves rebooting the lost node, in the hopes
that it will come back in a healthier state. Assuming it does come up
healthy, a couple main steps must occur.

First, it will rejoin the other DRBD members. These members will have a
"dirty block" list in memory which will allow them to quickly bring the
recovered server back into sync. During this time, you can bring that
node online (ie: set it primary and start accessing it via GFS2).
However, note that it can not be the sole primary device until it is
fully sync'ed.

Second, the cluster reforms to restore the recovered node. Once the
member has successfully joined, the resource manager (again, rgmanager
or pacemaker) will begin reorganizing the clustered resources as per
your configuration.

An important note:

If the fence call fails (either because of a fault in the fence device
or due to misconfiguration), the cluster will hang and *all* access to
the shared storage will stop.

*This is by design!*

The reason is that, should the cluster falsely assume the node was dead,
begin recovering the journal and then the hung node recovered and tried
to complete the write, the shared filesystem would be corrupted. That
is; "It is better a hung cluster than a corrupt cluster."

This is why fencing is so critical. :)

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org