[Linux-cluster] problems with clvmd and lvms on rhel6.1

Fri Aug 10 16:46:24 UTC 2012

Not sure if it relates, but I can say that without fencing, things will 
break in strange ways. The reason is that if anything triggers a fault, 
the cluster blocks by design and stays blocked until a fence call 
succeeds (which is impossible without fencing configured in the first 
place).

Can you please setup fencing, test to make sure it works (using 
'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this 
is done, test again for your problem. If it still exists, please paste 
the updated cluster.conf then. Also please include syslog from both 
nodes around the time of your LVM tests.

digimer

On 08/10/2012 12:38 PM, Poós Krisztián wrote:
> This is the cluster conf, Which is a clone of the problematic system on
> a test environment (without the ORacle and SAP instances, only focusing
> on this LVM issue, with an LVM resource)
>
> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster config_version="7" name="teszt">
> 	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
> 	<clusternodes>
> 		<clusternode name="rhel1.local" nodeid="1" votes="1">
> 			<fence/>
> 		</clusternode>
> 		<clusternode name="rhel2.local" nodeid="2" votes="1">
> 			<fence/>
> 		</clusternode>
> 	</clusternodes>
> 	<cman expected_votes="3"/>
> 	<fencedevices/>
> 	<rm>
> 		<failoverdomains>
> 			<failoverdomain name="all" nofailback="1" ordered="1" restricted="0">
> 				<failoverdomainnode name="rhel1.local" priority="1"/>
> 				<failoverdomainnode name="rhel2.local" priority="2"/>
> 			</failoverdomain>
> 		</failoverdomains>
> 		<resources>
> 			<lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
> 			<fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
> mountpoint="/lvm" name="teszt-fs"/>
> 		</resources>
> 		<service autostart="1" domain="all" exclusive="0" name="teszt"
> recovery="disable">
> 			<lvm ref="teszt-lv"/>
> 			<fs ref="teszt-fs"/>
> 		</service>
> 	</rm>
> 	<quorumd label="qdisk"/>
> </cluster>
>
> Here are the log parts:
> Aug 10 17:21:21 rgmanager I am node #2
> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
> Aug 10 17:21:22 rgmanager Loading Service Data
> Aug 10 17:21:29 rgmanager Initializing Services
> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
> Aug 10 17:21:31 rgmanager Services Initialized
> Aug 10 17:21:31 rgmanager State change: Local UP
> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
> Aug 10 17:23:25 rgmanager Failed to activate logical volume, teszt/teszt-lv
> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
> Aug 10 17:23:29 rgmanager Failed second attempt to activate teszt/teszt-lv
> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic error)
> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
> value: 1
> Aug 10 17:23:29 rgmanager Stopping service service:teszt
> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
> a real device
> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
> argument(s))
> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
> intervention required
> Aug 10 17:23:31 rgmanager Service service:teszt is failed
> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
> start.
> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop cleanly
> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
> Aug 10 17:25:14 rgmanager Failed to activate logical volume, teszt/teszt-lv
> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
> Aug 10 17:25:17 rgmanager Failed second attempt to activate teszt/teszt-lv
> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic error)
> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
> value: 1
> Aug 10 17:25:18 rgmanager Stopping service service:teszt
> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
> a real device
> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
> argument(s))
>
>
> After I manually started the lvm on node1 and tried to switch it on
> node2 it's not able to start it.
>
> Regards,
> Krisztian
>
>
> On 08/10/2012 05:15 PM, Digimer wrote:
>> On 08/10/2012 11:07 AM, Poós Krisztián wrote:
>>> Dear all,
>>>
>>> I hope that anyone run into this problem in the past, so maybe can help
>>> me resolving this issue.
>>>
>>> There is a 2 node rhel cluster with quorum also.
>>> There are clustered lvms, where the -c- flag is on.
>>> If I start clvmd all the clustered lvms became online.
>>>
>>> After this if I start rgmanager, it deactivates all the volumes, and not
>>> able to activate them anymore as there are no such devices anymore
>>> during the startup of the service, so after this, the service fails.
>>> All lvs remain without the active flag.
>>>
>>> I can manually bring it up, but only if after clvmd is started, I set
>>> the lvms manually offline by the lvchange -an <lv>
>>> After this, when I start rgmanager, it can take it online without
>>> problems. However I think, this action should be done by the rgmanager
>>> itself. All the logs is full with the next:
>>> rgmanager Making resilient: lvchange -an ....
>>> rgmanager lv_exec_resilient failed
>>> rgmanager lv_activate_resilient stop failed on ....
>>>
>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>> restart clvmd to make it work again. (sometimes killing it)
>>>
>>> Anyone has any idea, what to check?
>>>
>>> Thanks and regards,
>>> Krisztian
>>
>> Please paste your cluster.conf file with minimal edits.

-- 
Digimer
Papers and Projects: https://alteeve.com