[Linux-cluster] problems with clvmd and lvms on rhel6.1

Fri Aug 10 17:00:20 UTC 2012

See my thread earlier as I am having similar issues. I am testing this
soon, but I "think" the issue in my case is setting up SCSI fencing before
GFS2. So essentially it has nothing to fence off of, sees it as a fault,
and never recovers. I "think" my fix will be establish the LVMs, GFS2 etc
then put in the SCSI fence so that it can actually create the private
reservations. Then the fun begins in pulling the plug randomly to see how
it behaves.
________________________________________
Chip Burke

On 8/10/12 12:46 PM, "Digimer" <lists at alteeve.ca> wrote:

>Not sure if it relates, but I can say that without fencing, things will
>break in strange ways. The reason is that if anything triggers a fault,
>the cluster blocks by design and stays blocked until a fence call
>succeeds (which is impossible without fencing configured in the first
>place).
>
>Can you please setup fencing, test to make sure it works (using
>'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
>is done, test again for your problem. If it still exists, please paste
>the updated cluster.conf then. Also please include syslog from both
>nodes around the time of your LVM tests.
>
>digimer
>
>On 08/10/2012 12:38 PM, Poós Krisztián wrote:
>> This is the cluster conf, Which is a clone of the problematic system on
>> a test environment (without the ORacle and SAP instances, only focusing
>> on this LVM issue, with an LVM resource)
>>
>> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="7" name="teszt">
>> 	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>> 	<clusternodes>
>> 		<clusternode name="rhel1.local" nodeid="1" votes="1">
>> 			<fence/>
>> 		</clusternode>
>> 		<clusternode name="rhel2.local" nodeid="2" votes="1">
>> 			<fence/>
>> 		</clusternode>
>> 	</clusternodes>
>> 	<cman expected_votes="3"/>
>> 	<fencedevices/>
>> 	<rm>
>> 		<failoverdomains>
>> 			<failoverdomain name="all" nofailback="1" ordered="1" restricted="0">
>> 				<failoverdomainnode name="rhel1.local" priority="1"/>
>> 				<failoverdomainnode name="rhel2.local" priority="2"/>
>> 			</failoverdomain>
>> 		</failoverdomains>
>> 		<resources>
>> 			<lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
>> 			<fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
>> mountpoint="/lvm" name="teszt-fs"/>
>> 		</resources>
>> 		<service autostart="1" domain="all" exclusive="0" name="teszt"
>> recovery="disable">
>> 			<lvm ref="teszt-lv"/>
>> 			<fs ref="teszt-fs"/>
>> 		</service>
>> 	</rm>
>> 	<quorumd label="qdisk"/>
>> </cluster>
>>
>> Here are the log parts:
>> Aug 10 17:21:21 rgmanager I am node #2
>> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
>> Aug 10 17:21:22 rgmanager Loading Service Data
>> Aug 10 17:21:29 rgmanager Initializing Services
>> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
>> Aug 10 17:21:31 rgmanager Services Initialized
>> Aug 10 17:21:31 rgmanager State change: Local UP
>> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
>> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
>> Aug 10 17:23:25 rgmanager Failed to activate logical volume,
>>teszt/teszt-lv
>> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager Failed second attempt to activate
>>teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>error)
>> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:23:29 rgmanager Stopping service service:teszt
>> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
>> intervention required
>> Aug 10 17:23:31 rgmanager Service service:teszt is failed
>> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
>> start.
>> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
>>cleanly
>> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
>> Aug 10 17:25:14 rgmanager Failed to activate logical volume,
>>teszt/teszt-lv
>> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:25:17 rgmanager Failed second attempt to activate
>>teszt/teszt-lv
>> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>error)
>> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:25:18 rgmanager Stopping service service:teszt
>> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>>
>>
>> After I manually started the lvm on node1 and tried to switch it on
>> node2 it's not able to start it.
>>
>> Regards,
>> Krisztian
>>
>>
>> On 08/10/2012 05:15 PM, Digimer wrote:
>>> On 08/10/2012 11:07 AM, Poós Krisztián wrote:
>>>> Dear all,
>>>>
>>>> I hope that anyone run into this problem in the past, so maybe can
>>>>help
>>>> me resolving this issue.
>>>>
>>>> There is a 2 node rhel cluster with quorum also.
>>>> There are clustered lvms, where the -c- flag is on.
>>>> If I start clvmd all the clustered lvms became online.
>>>>
>>>> After this if I start rgmanager, it deactivates all the volumes, and
>>>>not
>>>> able to activate them anymore as there are no such devices anymore
>>>> during the startup of the service, so after this, the service fails.
>>>> All lvs remain without the active flag.
>>>>
>>>> I can manually bring it up, but only if after clvmd is started, I set
>>>> the lvms manually offline by the lvchange -an <lv>
>>>> After this, when I start rgmanager, it can take it online without
>>>> problems. However I think, this action should be done by the rgmanager
>>>> itself. All the logs is full with the next:
>>>> rgmanager Making resilient: lvchange -an ....
>>>> rgmanager lv_exec_resilient failed
>>>> rgmanager lv_activate_resilient stop failed on ....
>>>>
>>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>>> restart clvmd to make it work again. (sometimes killing it)
>>>>
>>>> Anyone has any idea, what to check?
>>>>
>>>> Thanks and regards,
>>>> Krisztian
>>>
>>> Please paste your cluster.conf file with minimal edits.
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.com
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster