[Linux-cluster] fencing: external vs watchdog

Maciej Bogucki maciej.bogucki at artegence.com
Fri Aug 17 11:03:03 UTC 2007


Mark Hlawatschek napisał(a):
> Hi,
> 
> I'd like to discuss and collect information about the two diffrent fencing 
> approaches.
> 
> external fencing: The failed cluster node is disconnected from the storage 
> device by onother node in the cluster. After a failure detection all cluster 
> activities are suspended until the IO fencing of the failed node has been 
> completed successfully.
> 
> watchdog fencing: A failed cluster node has to recognize the failure by itself 
> and will be shut down by a kind of internal watchdog feature.
> 
> Now, I see that theoretically the external fencing method (when configured 
> correctly) is the betterer approach because of the exactly defined state 
> during a fencing and recovery operation.
> 
> But the question is: What are real world examples of failures when the 
> watchdog fencing would fail and cause data corruption on the storage device ?
> I'd like to collect some real world examples and also theoretical approaches.
> 
> All comments welcome !

Hello,

Watchdog fencing isn't good for at least two reasons:
1. Watchodog is a piece of code which run in user space, so You don't
have 100% guarantee that it will run correctly.
2. Watchdog fencing can't protect You against split-brain situations,
where the consequences could be corruption of You data. Here comes
external fencing.

There is another point of view about Linux Clusters and other Commercial
Clusters(fe. Sun Cluster). Linux Cluster resist in user-space so You
don't have guarantee that local fencing will run ok, and You need
exteral fencing to resolve this main problem. Sun Cluster resist in
kernel-space, so when one node lost quorum it do "kernel panic" and You
have 100% guarantee that it will success.
For me network fencing(IPMI,DRAC,...) isn't good, because You have to
connect via network and it could fail, and so on. The best fencing
mechanism is fence_scsi, which is an I/O fencing agent. I can be used
with the SCSI devices that support persistent reservations (SPC-2 or
greater). In more cases You have shares storages taht support SPC-2 or
SPC-3.

Best Regards
Maciej Bogucki




More information about the Linux-cluster mailing list