README mdmonitor
================

Warranty:
========

This software is provided as is and comes with no warranty.
You can use it, modify it, improve it (inform its author in this case, he'll be enjoyed to benefit from your 
improvements).
It must not be used on production systems unless you know what you are doing and have analyzed completely the
code provided here.
The author cannot be taken as responsible for any problem caused by this software.
The author does not expect any Linux disribution provider to support this software unless they find it usefull 
and acceptable. 
The author is reachable at : brem dot belguebli at gmail dot com.
   

I)  Requirements:
    ============

sg3_utils
rsync
The ability to remote shell and copy files between all the nodes of the cluster (done through ssh authorized_keys 
for example). This is absolutely necessary!
All the nodes of a cluster must share the same devices names.
The devices used to assemble the MD arrays DO NOT need to be flagged (fdisk) "fd  Linux raid auto", I would rather say
that is would be better to avoid doing so to prevent all those startup scripts to automatically activate your MD devices.

II) Description:
    ===========

Mdmonitor has been written to allow the use of RAID1 mirrored arrays (with mdadm) under the control of rgmanager.
The initial need was to be able to create clustered resources on top of shared storage located in different 
datacenters by using RAID1 to address disaster recovery concerns.

As LVM mirror is not (yet) a fully cluster supported solution, and has some limitations (RHEL 5.3 not able yet to
live extend a mirrored lvol, mirror log device single point of failure, etc...), mdadm, one of the proven RAID1 
solution under Linux, was chosen.


III) Limitations:
     ===========

As RHCS doesn't provide any intra cluster copy files mechanism, a passwordless ssh based mechanism is needed to 
update all the nodes of the cluster at startup and during operation. This is absolutely necessary !.

This could lead to performance problems in very large clusters.

Only md RAID1 devices are supported. 

Mdmonitor was written to support up to maximum 3 legs (md components) per RAID1 devices. This was due to the fact
that bash doesn't allow the use of multidimensionnal arrays (${ARRAY[$i][$j]}). 

The 3 legs mirror can be usefull during storage migration. Mdmonitor allows one to still benefit from rgmanager
managed mirrored devices in this situation. 

mdmonitor was written "artisanaly". The author not being a professional coder, any volunteer to clean up and improve 
the code is welcome.
     
As md is not cluster aware, anyone with root privileges could manually activate any already active Raid device on 
another node. This is not worst than using hosttags (aka HA-LVM) as anyone could delete hosttags on another node
and activate a VG that is already somewhere else active. 

Mdmonitor checks among all nodes member of the cluster if a given MD device is not already active. If it is, it exists
in error, causing the resource to which it belongs to fail.

This is achieved by the ability of launching remote commands (thru ssh) between all the cluster nodes.

The author recommends using lvm-cluster.sh on top of mdmonitor to provide higher secured devices.
(lvm-cluster.sh from Rafael Mico Miranda https://www.redhat.com/archives/cluster-devel/2009-June/msg00065.html).


Mdmonitor is written to be used in active/passive (aka failover) fashion. It does not intend to be used in 
active/active setups (with shared LVM and GFS on top).   

Mdmonitor is built to be used as a service dedicated resource, not a global one.


IV) Detailled functionality:
    ======================= 

Mdmonitor is built to be used as a service dedicated resource, not a global one.

It relies on a raid configuration file (mandatory attribute) (OCF_RESKEY_raidconf) that has to be located under 
/etc/cluster/{myservicename}.

the OCF_RESKEY_raidconf file has to be called {myservicename}-raid.conf.

Note: this is due to the fact that I couldn't inherit the upper service name at lower resource level.

The format of the OCF_RESKEY_raidconf is as follows:

RAID_DEVICE[0]=/dev/md0
RAID_LEG_0[0]=/dev/mpath/mpath0
RAID_LEG_1[0]=/dev/mpath/mpath1
#  <-- this is not mandatory, just to make things look clearer
RAID_DEVICE[1]=/dev/md1
RAID_LEG_0[1]=/dev/mpath/mpath2
RAID_LEG_1[1]=/dev/mpath/mpath3
RAID_LEG_2[1]=/dev/mpath/mpath4  <-- 3rd leg if needed

The OCF_RESKEY_raidconf must be copied on all the nodes of the cluster.

The other attribute, which is optionnal, is the policy (OCF_RESKEY_policy) which defaults to quorum if undef.
The accepted values are quorum and strict.

In quorum mode, mdmonitor will allow the partial assembly of md devices in case of partial loss of shared
storage (2 legs mirror can start with 1 available leg, while 3 legs mirror can be started with 2 legs minimum)

In strict mode, mdmonitor will fail if at least one array is missing one of its legs at start time thus causing
the service to fail. This could serve paranoid setups.


1)  Start:  
    -----

When starting a service that is built with mdmonitor, mdmonitor first checks if the raid devices (/dev/md) are 
active. 
If so it exits 1 thus failing the service.

Else it checks then if the devices are active on the other active nodes. If so it exits 1 thus failing the service.

Else, it starts assembling the devices as declared in the OCF_RESKEY_raidconf file.

If, for a given device, a lock exists (cf below Managing Raid Failures), mdmonitor checks each leg if it was faulty. 
If so, and depending of its current status the leg can be ignored to prevent potential data corruption.

If no lock exists, the raid device is assembled with all of its legs (if all have status working).

When all the raid devices are assembled (even partial assembly if policy allows it -->quorum) all the nodes are updated 
with as many files as raid devices, each file name being the shortname mdX (md0 for /dev/md0, etc...) containing the
device name and the leg(s) it was assembled with.

Mdmonitor then exits 0.


2)  Stop:
    ----

All the devices must be unused (no active VG/LV or mounted FS) for mdmonitor to success stop.


3)  Monitor:
    -------

By default, rgmanager will invoke mdmonitor status (OPENCF standard states monitor action to be mandatory, while rgmanager
seems to require status) every 10 minutes.

It checks all the raid_devices declared in OCF_RESKEY_raidconf. If, for a given device, one leg is missing, mdmonitor 
will exit with status 33 which will trigger the repair function (cf below Managing Raid Failures).


V) Managing Raid Failures:          
   ======================
 
When failures occur on raid devices, mdmonitor (when run with status by rgmanager) will notice these failures and 
will call the function repair (by exiting 33).

Repair will check the missing legs of the correspondant raid devices and reassemble them if seen "working".

If the missing legs of their corresponding devices are not seen, they are logged.

It will create a lock file ({myservicename}.mdlock) of the following format:

degraded:/dev/mdX:{mymissingleg}:{local_node_name}
repairing:/dev/mdy:{myreassembledleg}:{local_node_name}
...

This file will be copied to all member nodes in /var/lock/md/{myservicename}.mdlock.

At the next pass of monitor, if all the raid devices are back to a consistent state (clean) a clean order is sent 
to all member cluster nodes to remove /var/lock/md/{myservicename}.mdlock

If not the lock file will be upgraded with its new status on all nodes.


VI) Managing failover with partial/unclean raid devices
    ===================================================

If a serVice fails over another cluster node while some devices are unclean, the new node will check each raid device state.

Mdmonitor will exit with an error status if, for a given raid device, the only leg available is the one that was marked 
faulty in the {servicename}.mdlock file.


VII) Troubleshooting and debugging 
     =============================

mdmonitor is a bash script, one should know how to debug it.

I have added some logging options:

	1) Measuring time passed in each function by setting the variable PERF to 1
	2) logging md devices status (if clean) by setting LOG=1

 

VIII) Things to do
     =============  

Cover the case of the unavailable nodes, when they join back, need to update them with all the start files 
(/etc/cluster/{myservicename}/md* and lock files if any, in case a failover happens on a newly available node.

For this, adding an action that is run every 30 seconds that checks the last time it was run the number of
available mamber nodes in the cluster and comparing it to the current available member nodes.
If there a difference, resync all the nodes (md state files and eventually mdlockfile if it exists). 

Somes cases may remain uncovered

Some cases may be handled in a non optimal way.

Some more scalability testing with dozens of MD arrays and a few more nodes.


Appendix
========

The main principles were inspired by a kown clustering solution from a well known manufacturer that discontinued 
their Linux clustering solution recently.  

The setup with which mdmonitor was written is the following:

A 2 nodes cluster located on 2 different sites running RHEL 5.3 X86_64.
2 HP StorageWorks EVA8100, each located on one site.
DM-multipath configured and sync'ed across the nodes (same dm-mp binding file). 
Mirroring is configured between LUNS of the 2 arrays (5 mirrored devices)
lvm-cluster resource script from  Rafael Mico Miranda with a few LV's on top of each MD device (added a vgscan at activation).
LVM exclusive activation.
Ext3 FS on top of each LV.
A ISCSI Qdisk from a third site to act as tie-breaker in case of a network partition.

A brief output of my cluster.conf:

----------------------------------- 8< --------------------------------

<service autostart="0" domain="rhcl1_fo" exclusive="0" name="pkg1" recovery="relocate">
                        <ip address="10.146.15.10" family="inet" monitor_link="1" sleeptime="6"/>
                        <mdmonitor policy="quorum" raidconf="pkg1-raid.conf">
                                <lvm-cluster lv_name="lvol1" name="fs1_pkg1" vg_name="/dev/vg10" exclusive="true">
                                <lvm-cluster lv_name="lvol0" name="fs2_pkg1" vg_name="/dev/vg12" exclusive="true">
                                        <fs name="fspkg1" device="/dev/vg10/lvol1" force_fsck="0" force_unmount="1" fstype="ext3" mountpoint="/fs1_pkg1"/>
                                        <fs name="fspkg2" device="/dev/vg12/lvol0" force_fsck="0" force_unmount="1" fstype="ext3" mountpoint="/fs2_pkg1"/>
                                </lvm-cluster>
                                </lvm-cluster>
                        </mdmonitor>
</service>
---------------------------------- >8 --------------------------------------