[Linux-cluster] Starter Cluster / GFS

Fri Nov 19 03:16:13 UTC 2010

On 11/18/2010 09:23 PM, Nicolas Ross wrote:
> Hi again !
>
> I am begining to play with my new servers. I got for starter 2 nodes (1u
> intel server platform, with a LSI Logic FC949ES FC card). I am like a child
> playing with his new toy at christmas...
>
> So, now I have a few points and questions. Sorry if it's long.
>
> 1. Raid sets
>
> So, I made up a 2-node cluster for the moment. I was able to bring up
> the cluster and make a GFS file system, in fact 2. We've made some test
> with different strategy of raid. Our first idea for the gfs was to use 5
> 1tb disks in raid 5. With that I got a 4 tb fs. It has been suggested
> previously that might not be a good idea. Our controler don't support
> directly raid 10 wich seems to be the consensus of a better setup. We
> will be making the 0 part on linux.
>
> I made 2 raid 1 sets of 1tb (2 disks) on our raid enclosure, and added
> them to a single vg. I created a lv on top of that, so I yield with a 2
> tb fs. We don't plan on using striping on the lv (-i2) because of the
> overhead if we add more space we will need to add 2 sets of raid1. So we
> plan on making a "starter" gfs with those 2 sets (2tb total). It's
> nearly double the 1.1 tb we have now, so we'll start with that.
>
> Now, we made some write test with dd, and judging by the disk activity,
> all data was writen to the first disk (pair of) of the vg, and never the
> second one. I assume that once the first disk is full, it'll start
> writing to the 2nd one. In the long term, I don't beleive it'll be a
> problem, but I'd prefer if the data was written alternativly on both
> disks without using stripes. Is that possible ? I looked at the --alloc
> option to the vgcreate, but it doesn't seem to be that.

RAID 0 means that data should be evenly split between either array 
member. I would suspect a problem there. If the RAID controller is a 
simple on (I am not familiar with the model), I might suggest building 
the entire array in Linux. It would be interesting to see the 
performance difference, if any. Though I prefer that mainly because I am 
more familiar with mdadm, so take that recommendation with a grain of salt.

> 2. Network setup.
>
> All our new servers have 3 nics, one being dedicated on to the
> mamagement module. I will be using the first one to make a private
> network that will be serving my services. In my new setup real routable
> ips will terminated at the router and will be nated to the private ones
> for eventual load-balancing. I will be using the second network on a
> different vlan and subnet for cluster communications. The management
> modules will be on that same vlan. So is this a good setup ? Should I be
> doing something differently ?

Sounds okay to me. I like having a dedicated subnet for data syncing, 
but what really matters is that cluster communication and fencing are 
separate from Internet traffic.

> 3. Deadlocks
>
> I found a small c program for testing the locks/s that is possible on a
> file accessed similtunously on many nodes. (It's ping_pong, some fo you
> might have used it). So, one of the parameters of that program is the
> number of nodes using that file +1. On one test, I used 2 in stead of 3
> on one of the node. Both profram on both nodes seemed stuck, not
> killable, not even -9. So I must assume that they were in some kind of
> deadlock. dlm_tool deadlock_check didn't show anything, and I can't make
> heads or tails from gfs2_tool lockdump or what to do with it. I was
> forced to reboot (forcebly) one of the node. Most likely on my
> production environement we won't arrive to that situation. But I want to
> know what happed and what to do to prevent it or stop that kind of lock.

Do you have your fence devices configured and working properly? A 
failure to fence can hang a cluster. Also, are you using managed 
switches and have either IGMP snooping or spanning tree enabled?

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org