[Linux-cluster] Known limits ?

Fri Jul 21 11:01:12 UTC 2006

Patrick Caulfield wrote:

>Mathieu Avila wrote:
>  
>
>>Hello GFS team,
>>
>>I'm trying to run a GFS filesystem on ~32 nodes, but there is a problem
>>when i start the daemons.
>>My nodes are called sam38 -> sam70.
>>- I run ccsd on all nodes (using init script) at the same time, and it's
>>ok.
>>- I run cman on all nodes (using init script) at the same time, and it's
>>ok. "cman_tool nodes" tell me alll nodes have rejoined the cluster.
>>- I run fenced on all nodes, one by one every second, and it fails from
>>sam57 to sam70.
>>
>>From the last one that succeeds (sam56), i see :
>>[root at sam56 ~]# cat /proc/cluster/services
>>Service          Name                              GID LID State     Code
>>Fence Domain:    "default"                           1   2 recover 4 -
>>[21 19 9 8 7 1 2 3 4 6 11 13 16 23 26 27 28 32 33 25 20 14]
>>
>>Then, at sam57 , i get in /var/log/messages:
>>Jul 20 18:38:17 sam57 kernel: CMAN: got WAIT barrier not in phase 1
>>TRANSITION.44 (2)
>>
>>when trying to run fenced.
>>
>>Then i /etc/init.d/fenced stop, and i get :
>>Jul 20 18:47:44 sam57 fenced[28722]: process_events: service leave failed
>>Jul 20 18:47:44 sam57 fenced: shutdown succeeded
>>
>>
>>When i start it again:
>>Jul 20 18:47:45 sam57 fenced[28964]: fence_domain_add: service set level
>>failed
>>
>>
>>After this step, i did stop everything on sam38 (fenced/ccsd/cman), to
>>see whether getting one node out would let me get another one in, but i
>>got this strange message on sam39:
>>Jul 20 19:14:04 sam39 kernel: CMAN: node sam38 has been removed from the
>>cluster : No response to messages
>>Jul 20 19:14:12 sam39 kernel: SM: 00000001 process_recovery_barrier
>>status=-104
>>
>>
>>In a previous exprerience, running fenced all at the same time lead to a
>>global cluster failure (everything did not respond): this is why i tried
>>to run them one by one.
>>
>>
>>All nodes are 64bits, i use the latest "cluster" code from the STABLE
>>branch of the CVS. My configuration file is classical and works for a
>>few nodes (tested many times on 5 nodes with no problem), except for this:
>><fence_daemon post_join_delay="30"></fence_daemon>
>>
>>Why does fenced daemon fails to start where cman succeeded ? I thought
>>it was just a service like any other, and was built on the top of CMAN.
>>Also, what are the known limits of the cluster infrastructure, in terms
>>of nodes ?
>>
>>Do you have any advice on how to remove this problem ? particularly, at
>>this point, does it change something to choose DLM instead of Gulm ?
>>(no, i think, but i'd rather be sure)
>>    
>>
>
>You do seem to have hit a limit.
>
>Personally I've only tested up to 31 nodes and recently someone posted to this
>list with a similar problem on 38 nodes - however, when I looked at the logs
>it actually seem to fall over at 32!
>
>There's nothing hard-coded in cman to limit the number of nodes to that amount
> and I can't find anything obvious that should cause it to happen.
>
>In the mean time all I suggest is that you use gulm for clusters with >= 32 nodes
>  
>

I had the problem with 38 nodes. Using gulmd it works without any problem.

Matteo