[Linux-cluster] Fwd: Re: problems with clvmd and lvms on rhel6.1
Poós Krisztián
krisztian at poos.hu
Fri Aug 10 18:27:48 UTC 2012
So I forgot the test environment in this case.
Here is the normal environment which is not fully productive yet, so I
can do tests on it...
Fencing (SCSI 3 persistent reservation) works and tested. I configured
the cluster to used, it, and the lvms are still down... the cluster not
able to mount the filesystem. However manually I can mount it, and also
the clustered lvm active flags looks ok, -a- on one node, and --- on the
other node: here are the logs and outputs and the config:
root at linuxsap2 cluster]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="14" name="linuxsap-c">
<clusternodes>
<clusternode name="linuxsap1-priv" nodeid="1">
<fence>
<method name="Method">
<device name="fence_dev"/>
</method>
</fence>
<unfence>
<device action="on" name="fence_dev"/>
</unfence>
</clusternode>
<clusternode name="linuxsap2-priv" nodeid="2">
<fence>
<method name="Method">
<device name="fence_dev"/>
</method>
</fence>
<unfence>
<device action="on" name="fence_dev"/>
</unfence>
</clusternode>
</clusternodes>
<cman expected_votes="3"/>
<quorumd label="qdisk_dev"/>
<rm>
<failoverdomains>
<failoverdomain name="FOD-Teszt" nofailback="1"
ordered="1" restricted="0">
<failoverdomainnode
name="linuxsap1-priv" priority="1"/>
<failoverdomainnode
name="linuxsap2-priv" priority="2"/>
</failoverdomain>
</failoverdomains>
<resources>
<lvm name="vg_PRD_oracle" vg_name="vg_PRD_oracle"/>
<fs device="/dev/vg_PRD_oracle/lv_PRD_orabin"
fsid="32283" fstype="ext4" mountpoint="/oracle/PRD" name="PRD_orabin"/>
</resources>
<service autostart="0" domain="FOD-Teszt"
name="FS_teszt" recovery="disable">
<lvm ref="vg_PRD_oracle"/>
<fs ref="PRD_oralog1"/>
</service>
</rm>
<fencedevices>
<fencedevice agent="fence_scsi" name="fence_dev"/>
</fencedevices>
</cluster>
[root at linuxsap2 cluster]#
ug 10 20:10:07 linuxsap1 rgmanager[9680]: Service service:FS_teszt is
recovering
Aug 10 20:10:07 linuxsap1 rgmanager[9680]: #71: Relocating failed
service service:FS_teszt
Aug 10 20:10:08 linuxsap1 rgmanager[9680]: Service service:FS_teszt is
stopped
Aug 10 20:11:21 linuxsap1 rgmanager[9680]: Starting stopped service
service:FS_teszt
Aug 10 20:11:21 linuxsap1 rgmanager[10777]: [lvm] Starting volume group,
vg_PRD_oracle
Aug 10 20:11:21 linuxsap1 rgmanager[10801]: [lvm] Failed to activate
volume group, vg_PRD_oracle
Aug 10 20:11:21 linuxsap1 rgmanager[10823]: [lvm] Attempting cleanup of
vg_PRD_oracle
Aug 10 20:11:22 linuxsap1 rgmanager[10849]: [lvm] Failed second attempt
to activate vg_PRD_oracle
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: start on lvm "vg_PRD_oracle"
returned 1 (generic error)
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: #68: Failed to start
service:FS_teszt; return value: 1
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: Stopping service service:FS_teszt
[root at linuxsap1 cluster]# lvs | grep PRD
lv_PRD_oraarch vg_PRD_oracle -wi-a--- 30.00g
lv_PRD_orabin vg_PRD_oracle -wi-a--- 10.00g
lv_PRD_oralog1 vg_PRD_oracle -wi-a--- 1.00g
lv_PRD_oralog2 vg_PRD_oracle -wi-a--- 1.00g
lv_PRD_sapdata1 vg_PRD_oracle -wi-a--- 408.00g
lv_PRD_sapmnt vg_PRD_sapmnt -wi-a--- 10.00g
lv_PRD_trans vg_PRD_trans -wi-a--- 40.00g
lv_PRD_usrsap vg_PRD_usrsap -wi-a--- 9.00g
[root at linuxsap2 cluster]# lvs | grep PRD
lv_PRD_oraarch vg_PRD_oracle -wi----- 30.00g
lv_PRD_orabin vg_PRD_oracle -wi----- 10.00g
lv_PRD_oralog1 vg_PRD_oracle -wi----- 1.00g
lv_PRD_oralog2 vg_PRD_oracle -wi----- 1.00g
lv_PRD_sapdata1 vg_PRD_oracle -wi----- 408.00g
lv_PRD_sapmnt vg_PRD_sapmnt -wi-a--- 10.00g
lv_PRD_trans vg_PRD_trans -wi-a--- 40.00g
lv_PRD_usrsap vg_PRD_usrsap -wi-a--- 9.00g
[root at linuxsap1 cluster]# mount /dev/vg_PRD_oracle/lv_PRD_orabin /oracle/PRD
[root at linuxsap1 cluster]# df -k /oracle/PRD/
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vg_PRD_oracle-lv_PRD_orabin
10321208 4753336 5043584 49% /oracle/PRD
On 08/10/2012 06:46 PM, Digimer wrote:
> Not sure if it relates, but I can say that without fencing, things will
> break in strange ways. The reason is that if anything triggers a fault,
> the cluster blocks by design and stays blocked until a fence call
> succeeds (which is impossible without fencing configured in the first
> place).
>
> Can you please setup fencing, test to make sure it works (using
> 'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
> is done, test again for your problem. If it still exists, please paste
> the updated cluster.conf then. Also please include syslog from both
> nodes around the time of your LVM tests.
>
> digimer
>
> On 08/10/2012 12:38 PM, Poós Krisztián wrote:
>> This is the cluster conf, Which is a clone of the problematic system on
>> a test environment (without the ORacle and SAP instances, only focusing
>> on this LVM issue, with an LVM resource)
>>
>> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="7" name="teszt">
>> <fence_daemon clean_start="0" post_fail_delay="0"
>> post_join_delay="3"/>
>> <clusternodes>
>> <clusternode name="rhel1.local" nodeid="1" votes="1">
>> <fence/>
>> </clusternode>
>> <clusternode name="rhel2.local" nodeid="2" votes="1">
>> <fence/>
>> </clusternode>
>> </clusternodes>
>> <cman expected_votes="3"/>
>> <fencedevices/>
>> <rm>
>> <failoverdomains>
>> <failoverdomain name="all" nofailback="1" ordered="1"
>> restricted="0">
>> <failoverdomainnode name="rhel1.local" priority="1"/>
>> <failoverdomainnode name="rhel2.local" priority="2"/>
>> </failoverdomain>
>> </failoverdomains>
>> <resources>
>> <lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
>> <fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
>> mountpoint="/lvm" name="teszt-fs"/>
>> </resources>
>> <service autostart="1" domain="all" exclusive="0" name="teszt"
>> recovery="disable">
>> <lvm ref="teszt-lv"/>
>> <fs ref="teszt-fs"/>
>> </service>
>> </rm>
>> <quorumd label="qdisk"/>
>> </cluster>
>>
>> Here are the log parts:
>> Aug 10 17:21:21 rgmanager I am node #2
>> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
>> Aug 10 17:21:22 rgmanager Loading Service Data
>> Aug 10 17:21:29 rgmanager Initializing Services
>> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
>> Aug 10 17:21:31 rgmanager Services Initialized
>> Aug 10 17:21:31 rgmanager State change: Local UP
>> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
>> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
>> Aug 10 17:23:25 rgmanager Failed to activate logical volume,
>> teszt/teszt-lv
>> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager Failed second attempt to activate
>> teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
>> error)
>> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:23:29 rgmanager Stopping service service:teszt
>> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
>> intervention required
>> Aug 10 17:23:31 rgmanager Service service:teszt is failed
>> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
>> start.
>> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
>> cleanly
>> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
>> Aug 10 17:25:14 rgmanager Failed to activate logical volume,
>> teszt/teszt-lv
>> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:25:17 rgmanager Failed second attempt to activate
>> teszt/teszt-lv
>> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
>> error)
>> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:25:18 rgmanager Stopping service service:teszt
>> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>>
>>
>> After I manually started the lvm on node1 and tried to switch it on
>> node2 it's not able to start it.
>>
>> Regards,
>> Krisztian
>>
>>
>> On 08/10/2012 05:15 PM, Digimer wrote:
>>> On 08/10/2012 11:07 AM, Poós Krisztián wrote:
>>>> Dear all,
>>>>
>>>> I hope that anyone run into this problem in the past, so maybe can help
>>>> me resolving this issue.
>>>>
>>>> There is a 2 node rhel cluster with quorum also.
>>>> There are clustered lvms, where the -c- flag is on.
>>>> If I start clvmd all the clustered lvms became online.
>>>>
>>>> After this if I start rgmanager, it deactivates all the volumes, and
>>>> not
>>>> able to activate them anymore as there are no such devices anymore
>>>> during the startup of the service, so after this, the service fails.
>>>> All lvs remain without the active flag.
>>>>
>>>> I can manually bring it up, but only if after clvmd is started, I set
>>>> the lvms manually offline by the lvchange -an <lv>
>>>> After this, when I start rgmanager, it can take it online without
>>>> problems. However I think, this action should be done by the rgmanager
>>>> itself. All the logs is full with the next:
>>>> rgmanager Making resilient: lvchange -an ....
>>>> rgmanager lv_exec_resilient failed
>>>> rgmanager lv_activate_resilient stop failed on ....
>>>>
>>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>>> restart clvmd to make it work again. (sometimes killing it)
>>>>
>>>> Anyone has any idea, what to check?
>>>>
>>>> Thanks and regards,
>>>> Krisztian
>>>
>>> Please paste your cluster.conf file with minimal edits.
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4925 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120810/a7989a98/attachment.p7s>
More information about the Linux-cluster
mailing list