[Linux-cluster] Clvm Hang after an node is fenced in a 2 nodes cluster

Thu Jun 4 06:54:07 UTC 2009

Description of problem: In a 2 nodes cluster, after 1 node is fence, any 
clvm command hang on the ramaining node. when the fenced node cluster 
come back in the cluster, any clvm command also hang, moreover the node 
do not activate any clustered vg, and so do not access any shared device.

Version-Release number of selected component (if applicable):
redhat 5.2
update device-mapper-1.02.28-2.el5.x86_64.rpm
       lvm2-2.02.40-6.el5.x86_64.rpm
       lvm2-cluster-2.02.40-7.el5.x86_64.rpm

Steps to Reproduce:
1.2 nodes cluster , quorum formed with qdisk
2.cold boot node 2
3.node 2 is evicted and fenced, service are taken over by node 1
4.node é come back in cluster, quorate, but no clustered vg are up and 
any lvm  related command hang
5.At this step every lvm command hang on node 1

Expected results: node 2 should be able to get back the lock on 
clustered lvm volume and node 1 should be able to issue any lvm relate 
command

Here are my cluster.conf and lvm.conf
<?xml version="1.0"?>
<cluster alias="rome" config_version="53" name="rome">
        <fence_daemon clean_start="0" post_fail_delay="9" 
post_join_delay="6"/>
        <clusternodes>
                <clusternode name="romulus.fr" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilo172"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="remus.fr" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilo173"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3"/>
        <totem consensus="4800" join="60" token="21002" 
token_retransmits_before_loss_const="20"/>
        <fencedevices>
                <fencedevice agent="fence_ilo" hostname="X.X.X.X" 
login="Administrator" name="ilo172" passwd="X.X.X.X"/>
                <fencedevice agent="fence_ilo" hostname="XXXX" 
login="Administrator" name="ilo173" passwd="XXXX"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
                <vm autostart="1" exclusive="0" migrate="live" 
name="alfrescoP64" path="/etc/xen" recovery="relocate"/>
                <vm autostart="1" exclusive="0" migrate="live" 
name="alfrescoI64" path="/etc/xen" recovery="relocate"/>
                <vm autostart="1" exclusive="0" migrate="live" 
name="alfrescoS64" path="/etc/xen" recovery="relocate"/>
        </rm>
        <quorumd interval="3" label="quorum64" min_score="1" tko="30" 
votes="1">
                <heuristic interval="2" program="ping -c3 -t2 X.X.X.X" 
score="1"/>
        </quorumd>
</cluster>

part of lvm.conf:
 # Type 3 uses built-in clustered locking.
    locking_type = 3

    # If using external locking (type 2) and initialisation fails,
    # with this set to 1 an attempt will be made to use the built-in
    # clustered locking.
    # If you are using a customised locking_library you should set this 
to 0.
    fallback_to_clustered_locking = 0

    # If an attempt to initialise type 2 or type 3 locking failed, perhaps
    # because cluster components such as clvmd are not running, with 
this set
    # to 1 an attempt will be made to use local file-based locking (type 1).
    # If this succeeds, only commands against local volume groups will 
proceed.
    # Volume Groups marked as clustered will be ignored.
    fallback_to_local_locking = 1

    # Local non-LV directory that holds file-based locks while commands are
    # in progress.  A directory like /tmp that may get wiped on reboot 
is OK.
    locking_dir = "/var/lock/lvm"

    # Other entries can go here to allow you to load shared libraries
    # e.g. if support for LVM1 metadata was compiled as a shared library use
    #   format_libraries = "liblvm2format1.so"
    # Full pathnames can be given.

    # Search this directory first for shared libraries.
    #   library_dir = "/lib"

    # The external locking library to load if locking_type is set to 2.
    #   locking_library = "liblvm2clusterlock.so"

part of lvm log on second node :

vgchange.c:165   Activated logical volumes in volume group "VolGroup00"
vgchange.c:172   7 logical volume(s) in volume group "VolGroup00" now active
cache/lvmcache.c:1220   Wiping internal VG cache
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:17:29 2009
commands/toolcontext.c:209   Set umask to 0077
locking/cluster_locking.c:83   connect() failed on local socket: 
Connexion refusée
locking/locking.c:259   WARNING: Falling back to local file-based locking.
locking/locking.c:261   Volume Groups with the clustered attribute will 
be inaccessible.
toollib.c:578   Finding all volume groups
toollib.c:491   Finding volume group "VGhomealfrescoS64"
metadata/metadata.c:2379   Skipping clustered volume group VGhomealfrescoS64
toollib.c:491   Finding volume group "VGhomealfS64"
metadata/metadata.c:2379   Skipping clustered volume group VGhomealfS64
toollib.c:491   Finding volume group "VGvmalfrescoS64"
metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoS64
toollib.c:491   Finding volume group "VGvmalfrescoI64"
metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoI64
toollib.c:491   Finding volume group "VGvmalfrescoP64"
metadata/metadata.c:2379   Skipping clustered volume group VGvmalfrescoP64
toollib.c:491   Finding volume group "VolGroup00"
libdm-report.c:981   VolGroup00
cache/lvmcache.c:1220   Wiping internal VG cache
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:17:29 2009
commands/toolcontext.c:209   Set umask to 0077
locking/cluster_locking.c:83   connect() failed on local socket: 
Connexion refusée
locking/locking.c:259   WARNING: Falling back to local file-based locking.
locking/locking.c:261   Volume Groups with the clustered attribute will 
be inaccessible.
toollib.c:542   Using volume group(s) on command line
toollib.c:491   Finding volume group "VolGroup00"
vgchange.c:117   7 logical volume(s) in volume group "VolGroup00" monitored
cache/lvmcache.c:1220   Wiping internal VG cache
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:20:45 2009
commands/toolcontext.c:209   Set umask to 0077
toollib.c:331   Finding all logical volumes
commands/toolcontext.c:188   Logging initialised at Wed Jun  3 15:20:50 2009
commands/toolcontext.c:209   Set umask to 0077
toollib.c:578   Finding all volume groups

group_tool on node 1
type             level name       id       state       
fence            0     default    00010001 none        
[1 2]
dlm              1     clvmd      00010002 none        
[1 2]
dlm              1     rgmanager  00020002 none        
[1]

group_tool on node 2
[root at remus ~]# group_tool
type             level name     id       state       
fence            0     default  00010001 none        
[1 2]
dlm              1     clvmd    00010002 none        
[1 2]

Additional info: