[Linux-cluster] Storage Cluster Newbie Questions - any helpwith answers greatly appreciated!

Mon Mar 15 23:23:13 UTC 2010

Hello again Andreas,

Interesting mix... assuming I follow your logic... you are suggesting I 
mix the - "single non-clustered LVM mirror" with the "HA-LVM tag based 
fail-over".

HA-LVM is where I started on this concept... so I have already setup the 
basic /etc/lvm/lvm.conf - "volume_list = [ "VolGroup00", "@`hostname`" ] 
type stuff for root mount only.

When I looked at the LVM mirroring in general the concept of the 
dedicated disk for mirror logging and/or use of memory made me pause... 
but... you are saying do something like on node#1 - take "shelf1: disk1" 
and LVM mirror it to "shelf2: disk1" - and put the mirror logging in 
memory?  Therefore in a fail-over condition - I obviously loose the in 
memory mirror logging - but node#2 would just see the LVM as in an 
inconsistent state - and resync the mirror?

-Michael

Andreas Pfaffeneder wrote, On 3/15/2010 7:52 AM:
> Hi Michael,
>
> a way to prevent both system from thinking that they're responsible for the
> FC-devices is to use LVM for building a host-based-mirror and  LVM-filter
> and -tags:
>
> - Set up your RHEL-Cluster
> - Modify /etc/lvm/lvm.conf: volume_list =
> ["local-vg-if-used-to-store-/","@local_hostname-as-in-cluster.conf"] (all
> systems of the cluster)
> - rebuild initrd, reboot
> -->  LVM only accesses local LVs and LVs with the local hostname in the
> lvm-tag
> - create an mirrored lv with lvcreate --addtag or lvchange --addtag so the
> lv will be active on one system
>
> RH-Cluster supports floating LVs on its own, just add the LV + FS as
> resource.
>
> If you're not using a LUN for mirror-logging, your're moving the LVs from
> one system to the other at the cost of one full rebuild of the host based
> mirror.
>
> Regards
> Andreas
>
> On Sun, 14 Mar 2010 22:28:20 -0700, "Michael @ Professional Edge LLC"
> <m3 at professionaledgellc.com>  wrote:
>    
>> Kaloyan,
>>
>> I agree - disabling the qla2xxx driver (Qlogic HBA) from starting at
>> boot would be the simple method of handling the issue.  Then I just put
>> all the commands to load the driver, multipath, mdadm, etc... inside
>> cluster scripts.
>>
>> Amusingly it seems I am missing something very basic - as I can't seem
>> to figure out how to not load the qla2xxx driver.
>>
>> Do you happen to know the syntax to make the qla2xxx driver not load at
>> boot automatically?
>>
>> I've been messing with /etc/modprobe.conf - and mkinird - but no
>> combination has resulted in the - qla2xxx being properly disabled during
>> boot - I did accomplish making one of my nodes unable to mount it's root
>> partition - but I don't consider that success. :-)
>>
>>
>> As for your 2nd idea; I have seen folks doing something similar in that
>> mode; when the disks are local to the node.  But in my case - all nodes
>> - can already see all LUNs - so I dont really have any need to do an
>> iSCSI export - appreciate the thought though.
>>
>> -Michael
>>
>>
>> Kaloyan Kovachev wrote, On 3/4/2010 10:28 AM:
>>      
>>> On Thu, 04 Mar 2010 09:26:35 -0800, Michael @ Professional Edge LLC
>>>        
>> wrote
>>      
>>>        
>>>> Hello Kaloyan,
>>>>
>>>> Thank you for the thoughts.
>>>>
>>>> You are correct when I said - "Active / Passive" - I simply meant that
>>>>          
>> I
>>      
>>>> had no need for "Active / Active" - and floating IP on the NFS share
>>>> would be exactly what I had in mind.
>>>>
>>>> The software raid - of any type, raid1,5,6 etc... is the issue.  From
>>>> what I have read - mdadm - is not cluster aware... and... since all
>>>> disks are seen by all RHEL nodes. - As Leo mentioned; some method to
>>>> disable the kernel from finding detecting and attempting to assemble
>>>>          
>> all
>>      
>>>> the available software raids - is a major problem.  This is why I was
>>>> asking if perhaps - CLVM w/mirroring would be a better method.
>>>>          
>> Although
>>      
>>>> since it was just introduced in RHEL 5.3 - I am a bit leery.
>>>>
>>>>
>>>>          
>>> I am not common with FC, so maybe completely wrong here, but if you do
>>>        
>> not
>>      
>>> start multipath and load your HBA drivers on boot, how the FC disks
>>>        
>> based
>>      
>>> software raid will start at all?
>>>
>>> even if started you may still issue 'mdadm --stop /dev/mdX' in S00 as
>>> suggested from Leo and assemble it again as a cluster service later
>>>
>>>
>>>        
>>>> Sorry for being confusing - yes - the linux machines will have a
>>>> completely different filesystem share; than the windows machines.  My
>>>> original thought was I would do "node#1 primary nfs share (floating
>>>> ip#1) to linux machines w/node#2 backup" - and then "node#2 primary nfs
>>>> or samba share (floating ip#2) to windows machines w/node#1 backup".
>>>>
>>>> Any more thoughts you have would be appreciated... as my original plan
>>>> with MDADM w/HA-LVM - so far doesn't seem very possible.
>>>>
>>>>
>>>>          
>>> Then there are two services each with its own raid array and ip, but
>>>        
>> basically
>>      
>>> the same
>>>
>>> another idea ... not using it in production, but i had good results
>>>        
>> (testing)
>>      
>>> with (small) software raid5 array from 3 nodes ... Local device on each
>>>        
>> node
>>      
>>> exported via iSCSI and software RAID5 over the imported ones which is
>>>        
>> then
>>      
>>> used from LVM. Weird, but worked and the only problem was that on every
>>>        
>> reboot
>>      
>>> of any node the raid is rebuilt, which i won't happen in your case as
>>>        
>> you will
>>      
>>> see all the disks in sync (after the initial sync done on only one of
>>>        
>> them)
>>      
>>> ... you may give it a try
>>>
>>>
>>>        
>>>> -Michael
>>>>
>>>> Kaloyan Kovachev wrote, On 3/4/2010 8:52 AM:
>>>>
>>>>          
>>>>> Hi,
>>>>>
>>>>> On Wed, 03 Mar 2010 11:16:07 -0800, Michael @ Professional Edge LLC
>>>>>            
>> wrote
>>      
>>>>>
>>>>>            
>>>>>> Hail Linux Cluster gurus,
>>>>>>
>>>>>> I have researched myself into a corner and am looking for advice.
>>>>>>              
>> I've
>>      
>>>>>> never been a "clustered storage guy", so I apologize for the
>>>>>>              
>> potentially
>>      
>>>>>> naive set of questions.  ( I am savvy on most other aspects of
>>>>>>              
>> networks,
>>      
>>>>>> hardware, OS's etc... but not storage systems).
>>>>>>
>>>>>> I've been handed ( 2 ) x86-64 boxes w/2 local disks each; and ( 2 )
>>>>>> FC-AL disk shelves w/14 disks each; and told to make a mini NAS/SAN
>>>>>>              
>> (NFS
>>      
>>>>>> required, GFS optional).  If I can get this working reliably then
>>>>>>              
>> there
>>      
>>>>>> appear to be about another ( 10 ) FC-AL shelves and a couple of Fiber
>>>>>> Switches laying around that will be handed to me.
>>>>>>
>>>>>> NFS filesystems will be mounted by several (less than 6) linux
>>>>>>              
>> machines,
>>      
>>>>>> and a few (less than 4) windows machines [[ microsoft nfs client ]] -
>>>>>> all more or less doing web server type activities (so lots of reads
>>>>>>              
>> from
>>      
>>>>>> a shared filesystem - log files not on NFS so no issue with high IO
>>>>>> writes).  I'm locked into NFS v3 for various reasons.  Optionally the
>>>>>> linux machines can be clustered and GFS'd instead - but I would still
>>>>>> need to come up with a solution for the windows machines - so a NAS
>>>>>> solution is still required even if I do GFS to the linux boxes.
>>>>>>
>>>>>> Active / Passive on the NFS is fine.
>>>>>>
>>>>>>
>>>>>>              
>>>>> Why not start NFS/Samba on both machines with only the IP floating
>>>>>            
>> between
>>      
>>>>> them then?
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> * Each of the ( 2 ) x86-64 machines have a Qlogic dual HBA 1 fiber
>>>>>> direct connected to each shelf  (no fiber switches yet - but will
>>>>>>              
>> have
>>      
>>>>>> them later if I can make this all work); I've loaded RHEL 5.4 x86-64.
>>>>>>
>>>>>> * Each of the ( 2 ) RHEL 5.4 boxes - used the 2 local disks w/onboard
>>>>>> fake raid1 = /dev/sda - basic install so /boot and LVM for the rest -
>>>>>> nothing special here (didn't do mdadm basically for simplicity of
>>>>>>              
>> /dev/sda)
>>      
>>>>>> * Each of the ( 2 ) RHEL 5.4 boxes can see all the disks on both
>>>>>>              
>> shelves
>>      
>>>>>> - and since I don't have Fiber Switches yet - at the moment there is
>>>>>> only 1 path to each disk; however as I assume I will figure out a
>>>>>>              
>> method
>>      
>>>>>> to make this work - I have enabled multipath - and therefore I have
>>>>>> consistent names to 28 disks.
>>>>>>
>>>>>> Here's my dilemma.  How do I best add Redundancy to the Disks,
>>>>>>              
>> removing
>>      
>>>>>> as many single points of failure, and preserving as much diskspace as
>>>>>> possible?
>>>>>>
>>>>>> My initial thought was - to take "shelf1:disk1 and shelf2:disk1" and
>>>>>>              
>> put
>>      
>>>>>> them into a software raid1 - mdadm; then put the resulting /dev/md0
>>>>>>              
>> into
>>      
>>>>>> a LVM.  When I need more diskspace, I just then create "shelf1:disk2
>>>>>>              
>> and
>>      
>>>>>> shelf2:disk2" as another software raid1 then just add the new
>>>>>>              
>> "/dev/md1"
>>      
>>>>>> into the LVM and expand the FS. This handles a couple things in my
>>>>>>              
>> mind:
>>      
>>>>>> 1. Each shelf is really a FC-AL so it's possible that a single disk
>>>>>> going nuts could flood the FC-AL and all the disks in that shelf go
>>>>>>              
>> poof
>>      
>>>>>> until the controller can figure itself out and/or the bad disk is
>>>>>>              
>> removed.
>>      
>>>>>> 2. Efficient I am retaining 50% storage capacity after redundancy -
>>>>>>              
>> if I
>>      
>>>>>> can do the "shelf1:disk1 + shelf2:disk2" mirrors; plus all bandwidth
>>>>>> used is spread across the 2 HBA fibers and nothing goes over the TCP
>>>>>> network.  Conversely DRBD doesn't excite me much - as I then have to
>>>>>>              
>> do
>>      
>>>>>> both raid in the shelf (probably still with MDADM) and then I add TCP
>>>>>> (ethernet) based RAID1 between the nodes - and when all is said and
>>>>>>              
>> done
>>      
>>>>>> - I only the have 25% of storage capacity still available after
>>>>>>              
>> redundancy.
>>      
>>>>>> 3. I easy to add more diskspace - as each new mirror (software raid1)
>>>>>> can just be added to an existing LVM.
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> You may create RAID1 (between the two shelfs) over RAID6 (on the disks
>>>>>            
>> from
>>      
>>>>> the same shelf), so you will loose only 2 more disks per shelf or
>>>>>            
>> about 40%
>>      
>>>>> storage space left, but more stable and faster. Or several RAID6
>>>>>            
>> arrays with
>>      
>>>>> 2+2 disks from each shelf - again 50% storage space, but better
>>>>>            
>> performance
>>      
>>>>> with the same chance for data loss like with several RAID1 ... the
>>>>>            
>> resulting
>>      
>>>>> mdX you may add to LVM and use the logical volumes
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>>        From what I can find messing with Luci (Conga) though... is - I
>>>>>>              
>> don't
>>      
>>>>>> see any resource scripts listed for - "mdadm" (on RHEL 5.4) - so
>>>>>>              
>> would
>>      
>>>>>> my idea even work  (I have found some posts asking for a mdadm
>>>>>>              
>> resource
>>      
>>>>>> script but I've seen no response)?  I also see with RHEL 5.3 LVM has
>>>>>> mirrors that can be clustered now - is this the right answer?  I've
>>>>>>              
>> done
>>      
>>>>>> a ton of reading but everything I've dug up so far; assumes that the
>>>>>> fiber devices are being presented by a SAN that is doing the
>>>>>>              
>> redundancy
>>      
>>>>>> before the RHEL box sees the disk... or... there are a ton of
>>>>>>              
>> examples
>>      
>>>>>> of where fiber is not in the picture and there are a bunch of locally
>>>>>> attached hosts presenting storage onto the TCP (ethernet) - but I've
>>>>>>              
>> not
>>      
>>>>>> found nearly anything on my situation...
>>>>>>
>>>>>> So... here I am... :-)  I really just have 2 nodes - who can both see
>>>>>>              
>> -
>>      
>>>>>> a bunch of disks (JBOD) and I want to present them to multiple hosts
>>>>>>              
>> via
>>      
>>>>>> NFS (required) or GFS (to linux boxes only).
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> if the Windows and Linux data are different volumes it is better to
>>>>>            
>> leave the
>>      
>>>>> GFS partition(s) available only via iSCSI to the linux nodes
>>>>>            
>> participating in
>>      
>>>>> the cluster and not to mount it/them locally for the NFS/Samba shares,
>>>>>            
>> but if
>>      
>>>>> the data should be the same you may go even Active/Active with GFS
>>>>>            
>> over iSCSI
>>      
>>>>> [over CLVM and/or] [over DRBD] over RAID and use NFS/Samba over GFS as
>>>>>            
>> a
>>      
>>>>> service in the cluster. It all depends on how the data will be used
>>>>>            
>> from the
>>      
>>>>> storage
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> All ideas - are greatly appreciated!
>>>>>>
>>>>>> -Michael
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>>
>>>>>>              
>>>>>            
>>>        
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>      
>