[Linux-cluster] Re: [RFC] New lvm_vg.sh agent for non-clustered lvm2 VGs

Wed Oct 3 22:20:18 UTC 2007

Hi,

like reported in the previous mail I tried to implement a new VG based
rgmanager agent for non-clustered lvm2 VGs.

It's a sort of experiment, I just tried to implement in it some ideas
and I'd like to get some suggestions on them just before going on
working on it (it's quite probable that I had wrong assumption or missed
some problems around).

I did some tests with various scenarios. For example: All Machines that
lost the same PVs; only one machine that lost PVs; removing an LVs,
removing the whole VG etc... But to be sure that there aren't problem I
have to do more and more tests and also try to document them.

Below the most important points.

Thanks!
Bye!

===================================================================================
What you can/cannot do with it:

*) Only one service can own a VG. (this mimics other clusters)
*) In the parameter lv_name you can define a space separated list of
volumes to start/monitor. (Is this OCF/xml compliant?). If it's empty
all the LVs are started/monitored.
On stop all the volumes are stopped anyway.

=====
How its implemented:

*) Take control of the whole VG instead of a single LV by tagging the VG
and not the LVs.
*) Optional (can be removed if not wanted): I tried to let add
additional tag to the VG/LVs. The tags used to "lock" to VGs to a node
should be of type CMAN_NODE_${nodename}. So also the "volume_list"
in /etc/lvm/lvm.conf should be changed in this way.
*) Check if, for various reasons (manual intervention, race condition),
the VG has multiple node tags on it.
*) Check that the LVs aren't "node" tagged to avoid that they can be
activated also on other nodes.
*) The service is defined as unique="1" as only one service should use
one VG.
*) Shortened the status intervals (also for debugging pourposes, they
can be increased to better values).

=====
Bug found in current lvm.sh that I tried to fix:

Note: Using mirror devices dmeventd runs "vgreduce --removemissing" only
when a writes fails. During reads nothing is done by dmeventd on the VG
so all programs like vgs/lvs will fail.

*) If one or more PVs are missing when the service is started then
vgs/lvs will return an error and not provide the tags. The scripts
wrongly assumes that the vg (previously lv) isn't owned, tries to steal
it but the tagging fails and so runs "vgreduce --removemissing". But the
VG can be active on another machine.
Solution: First check for missing pv, get tags in partial mode and if
not owned try to fix it issuing a "vgreduce --removemissing"

*) If one or more PVs are missing when the service is stopped then vgs
will return an error and not provide the tags. The scripts wrongly
assumes that the vg (previously lv) isn't owned, tries to steal it but
the tagging fails and so the stop fails too putting the service in a
failed status the need manual intervention.
Solution: First check for missing pv, get tags in partial mode and if
not owned try to fix it issuing a "vgreduce --removemissing"

*) If one or more PVs are missing after the start, the status check will
assume that the node isn't owning as it cannot get the tags and will try
to change the vg, then it return an error that will bring to a service
restart.
Solution: First check for missing pv and if so return an error.
Introduce a recover action that will do the same as start. This will fix
the VG.

*)If one machine loses some disks, dmeventd or the start/stop scripts
will call vgreduce --removemissing. Then, if the service is switched to
another machine that sees all the disks or the disks come back, for
every lvm command launched a warning is issued "Inconsistent metadata
found for VG %s - updating to use version %s".
Solution: Avoid this by calling lvm_exec_resilient on all commands
except on some vgs called with --partial or that needs to get the real
state (vgs doesn't report "Inconsistent metadata found for VG %s -
updating to use version %s" anyway so no problems). As I needed to get
also some output echoed (returned) by lvm_exec_resilient I directed all
the ocf_log output of lvm_exec_resilient to stderr.

=====
Things to do:

*)Maybe increase the start/stop/status timeouts (5 seconds looks too
short for a "vgreduce --removemissing" an big VGs).
*)Implement better validation of parameters.

On Wed, 2007-10-03 at 09:57 -0500, Jonathan Brassow wrote:
> Great stuff!  Much of what you are describing I've thought about in  
> the past, but just haven't had the cycles to work on.  You can see in  
> the script itself, the comments at the top mention the desire to  
> operate on the VG level.  You can also see a couple vg_* functions  
> that simply return error right now, but were intended to be filled in.

>   brassow
> 
-- 
Simone Gotti
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvm_vg.sh.20061003-2330
Type: application/x-shellscript
Size: 21309 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071004/eccbaa92/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071004/eccbaa92/attachment.sig>