[Linux-cluster] Fwd: Re: problems with clvmd and lvms on rhel6.1

Fri Aug 10 18:27:48 UTC 2012

So I forgot the test environment in this case.
Here is the normal environment which is not fully productive yet, so I
can do tests on it...
Fencing (SCSI 3 persistent reservation) works and tested. I configured
the cluster to used, it, and the lvms are still down... the cluster not
able to mount the filesystem. However manually I can mount it, and also
the clustered lvm active flags looks ok, -a- on one node, and --- on the
other node: here are the logs and outputs and the config:

root at linuxsap2 cluster]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="14" name="linuxsap-c">
        <clusternodes>
                <clusternode name="linuxsap1-priv" nodeid="1">
                        <fence>
                                <method name="Method">
                                        <device name="fence_dev"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="fence_dev"/>
                        </unfence>
                </clusternode>
                <clusternode name="linuxsap2-priv" nodeid="2">
                        <fence>
                                <method name="Method">
                                        <device name="fence_dev"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="fence_dev"/>
                        </unfence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3"/>
        <quorumd label="qdisk_dev"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="FOD-Teszt" nofailback="1"
ordered="1" restricted="0">
                                <failoverdomainnode
name="linuxsap1-priv" priority="1"/>
                                <failoverdomainnode
name="linuxsap2-priv" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <lvm name="vg_PRD_oracle" vg_name="vg_PRD_oracle"/>
                        <fs device="/dev/vg_PRD_oracle/lv_PRD_orabin"
fsid="32283" fstype="ext4" mountpoint="/oracle/PRD" name="PRD_orabin"/>
                </resources>
                <service autostart="0" domain="FOD-Teszt"
name="FS_teszt" recovery="disable">
                        <lvm ref="vg_PRD_oracle"/>
                        <fs ref="PRD_oralog1"/>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_scsi" name="fence_dev"/>
        </fencedevices>
</cluster>
[root at linuxsap2 cluster]#

ug 10 20:10:07 linuxsap1 rgmanager[9680]: Service service:FS_teszt is
recovering
Aug 10 20:10:07 linuxsap1 rgmanager[9680]: #71: Relocating failed
service service:FS_teszt
Aug 10 20:10:08 linuxsap1 rgmanager[9680]: Service service:FS_teszt is
stopped
Aug 10 20:11:21 linuxsap1 rgmanager[9680]: Starting stopped service
service:FS_teszt
Aug 10 20:11:21 linuxsap1 rgmanager[10777]: [lvm] Starting volume group,
vg_PRD_oracle
Aug 10 20:11:21 linuxsap1 rgmanager[10801]: [lvm] Failed to activate
volume group, vg_PRD_oracle
Aug 10 20:11:21 linuxsap1 rgmanager[10823]: [lvm] Attempting cleanup of
vg_PRD_oracle
Aug 10 20:11:22 linuxsap1 rgmanager[10849]: [lvm] Failed second attempt
to activate vg_PRD_oracle
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: start on lvm "vg_PRD_oracle"
returned 1 (generic error)
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: #68: Failed to start
service:FS_teszt; return value: 1
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: Stopping service service:FS_teszt

[root at linuxsap1 cluster]# lvs | grep PRD
  lv_PRD_oraarch  vg_PRD_oracle        -wi-a---  30.00g

  lv_PRD_orabin   vg_PRD_oracle        -wi-a---  10.00g

  lv_PRD_oralog1  vg_PRD_oracle        -wi-a---   1.00g

  lv_PRD_oralog2  vg_PRD_oracle        -wi-a---   1.00g

  lv_PRD_sapdata1 vg_PRD_oracle        -wi-a--- 408.00g

  lv_PRD_sapmnt   vg_PRD_sapmnt        -wi-a---  10.00g

  lv_PRD_trans    vg_PRD_trans         -wi-a---  40.00g

  lv_PRD_usrsap   vg_PRD_usrsap        -wi-a---   9.00g

[root at linuxsap2 cluster]# lvs | grep PRD
  lv_PRD_oraarch  vg_PRD_oracle        -wi-----  30.00g

  lv_PRD_orabin   vg_PRD_oracle        -wi-----  10.00g

  lv_PRD_oralog1  vg_PRD_oracle        -wi-----   1.00g

  lv_PRD_oralog2  vg_PRD_oracle        -wi-----   1.00g

  lv_PRD_sapdata1 vg_PRD_oracle        -wi----- 408.00g

  lv_PRD_sapmnt   vg_PRD_sapmnt        -wi-a---  10.00g

  lv_PRD_trans    vg_PRD_trans         -wi-a---  40.00g

  lv_PRD_usrsap   vg_PRD_usrsap        -wi-a---   9.00g

[root at linuxsap1 cluster]# mount /dev/vg_PRD_oracle/lv_PRD_orabin /oracle/PRD

[root at linuxsap1 cluster]# df -k /oracle/PRD/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_PRD_oracle-lv_PRD_orabin
                      10321208   4753336   5043584  49% /oracle/PRD

On 08/10/2012 06:46 PM, Digimer wrote:
> Not sure if it relates, but I can say that without fencing, things will
> break in strange ways. The reason is that if anything triggers a fault,
> the cluster blocks by design and stays blocked until a fence call
> succeeds (which is impossible without fencing configured in the first
> place).
> 
> Can you please setup fencing, test to make sure it works (using
> 'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
> is done, test again for your problem. If it still exists, please paste
> the updated cluster.conf then. Also please include syslog from both
> nodes around the time of your LVM tests.
> 
> digimer
> 
> On 08/10/2012 12:38 PM, Poós Krisztián wrote:
>> This is the cluster conf, Which is a clone of the problematic system on
>> a test environment (without the ORacle and SAP instances, only focusing
>> on this LVM issue, with an LVM resource)
>>
>> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="7" name="teszt">
>>     <fence_daemon clean_start="0" post_fail_delay="0"
>> post_join_delay="3"/>
>>     <clusternodes>
>>         <clusternode name="rhel1.local" nodeid="1" votes="1">
>>             <fence/>
>>         </clusternode>
>>         <clusternode name="rhel2.local" nodeid="2" votes="1">
>>             <fence/>
>>         </clusternode>
>>     </clusternodes>
>>     <cman expected_votes="3"/>
>>     <fencedevices/>
>>     <rm>
>>         <failoverdomains>
>>             <failoverdomain name="all" nofailback="1" ordered="1"
>> restricted="0">
>>                 <failoverdomainnode name="rhel1.local" priority="1"/>
>>                 <failoverdomainnode name="rhel2.local" priority="2"/>
>>             </failoverdomain>
>>         </failoverdomains>
>>         <resources>
>>             <lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
>>             <fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
>> mountpoint="/lvm" name="teszt-fs"/>
>>         </resources>
>>         <service autostart="1" domain="all" exclusive="0" name="teszt"
>> recovery="disable">
>>             <lvm ref="teszt-lv"/>
>>             <fs ref="teszt-fs"/>
>>         </service>
>>     </rm>
>>     <quorumd label="qdisk"/>
>> </cluster>
>>
>> Here are the log parts:
>> Aug 10 17:21:21 rgmanager I am node #2
>> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
>> Aug 10 17:21:22 rgmanager Loading Service Data
>> Aug 10 17:21:29 rgmanager Initializing Services
>> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
>> Aug 10 17:21:31 rgmanager Services Initialized
>> Aug 10 17:21:31 rgmanager State change: Local UP
>> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
>> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
>> Aug 10 17:23:25 rgmanager Failed to activate logical volume,
>> teszt/teszt-lv
>> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager Failed second attempt to activate
>> teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
>> error)
>> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:23:29 rgmanager Stopping service service:teszt
>> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
>> intervention required
>> Aug 10 17:23:31 rgmanager Service service:teszt is failed
>> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
>> start.
>> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
>> cleanly
>> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
>> Aug 10 17:25:14 rgmanager Failed to activate logical volume,
>> teszt/teszt-lv
>> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:25:17 rgmanager Failed second attempt to activate
>> teszt/teszt-lv
>> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
>> error)
>> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:25:18 rgmanager Stopping service service:teszt
>> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>>
>>
>> After I manually started the lvm on node1 and tried to switch it on
>> node2 it's not able to start it.
>>
>> Regards,
>> Krisztian
>>
>>
>> On 08/10/2012 05:15 PM, Digimer wrote:
>>> On 08/10/2012 11:07 AM, Poós Krisztián wrote:
>>>> Dear all,
>>>>
>>>> I hope that anyone run into this problem in the past, so maybe can help
>>>> me resolving this issue.
>>>>
>>>> There is a 2 node rhel cluster with quorum also.
>>>> There are clustered lvms, where the -c- flag is on.
>>>> If I start clvmd all the clustered lvms became online.
>>>>
>>>> After this if I start rgmanager, it deactivates all the volumes, and
>>>> not
>>>> able to activate them anymore as there are no such devices anymore
>>>> during the startup of the service, so after this, the service fails.
>>>> All lvs remain without the active flag.
>>>>
>>>> I can manually bring it up, but only if after clvmd is started, I set
>>>> the lvms manually offline by the lvchange -an <lv>
>>>> After this, when I start rgmanager, it can take it online without
>>>> problems. However I think, this action should be done by the rgmanager
>>>> itself. All the logs is full with the next:
>>>> rgmanager Making resilient: lvchange -an ....
>>>> rgmanager lv_exec_resilient failed
>>>> rgmanager lv_activate_resilient stop failed on ....
>>>>
>>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>>> restart clvmd to make it work again. (sometimes killing it)
>>>>
>>>> Anyone has any idea, what to check?
>>>>
>>>> Thanks and regards,
>>>> Krisztian
>>>
>>> Please paste your cluster.conf file with minimal edits.
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4925 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120810/a7989a98/attachment.p7s>