[Linux-cluster] 32 nodes limit?

Marc Aurele La France tsi at ualberta.ca
Tue Sep 23 16:01:07 UTC 2008


On Mon, 15 Sep 2008, Christine Caulfield wrote:
> Marc Aurele La France wrote:
>> I'm trying to move a 5TB filespace from NFS to GFS2.  I have a P4 (the
>> current NFS server) and 33 Opteron nodes, all running a stock 2.6.22
>> kernel, OpenAIS 0.80.3, and a 2.00.00 cluster suite.  For now, I've
>> dummied out fencing and set expected_votes to 1.  I can start/stop cman
>> on all nodes no problem.  With all cman's running, I've formatted,
>> mounted and populated the filesystem using the P4.  Proceeding through
>> the Opterons to mount the filesystem succeeds until the 32nd node, at
>> which point mount.gfs2 hangs (in "D" according to `ps ax`).  Going back,
>> the first 16 systems that have mounted the filesystem can still `ls` the
>> top level directory, but attempts to do so on the remaining systems also
>> get stuck in "D".  Any attempt to unmount the filesystem throws the
>> entire setup in "D".

>> Due to various considerations, moving to more recent versions is not the
>> preferred option at this point.  Hence my question.

> CMAN/openais in RHEL5 seems to be happy up to around 48 nodes (again
> this is not a QE figure, it's something we have tested in development
> only) with appropriate tuning. If you are seeing problems then it might
> be helpful to adjust some of the times use in the openais totem
> protocol. man openais.conf will tell you something about them. Before
> doing this though it's worth checking the output of "group_tool" command
> and syslog to see if there are any openais or other daemon errors that
> might be causing your problems. If necessary post them to this list.

> It's also worth mentioning that 2.00.00 has had a considerable number of
> bugfixes applied since it was released and the current version is
> 2.03.07. I do strongly recommend you upgrade to this version even though
> you say it is not "the preferred option at this point".

> I hope this helps,

It most certainly does.  Thanks for the hint.  It turns out I had 
neglected to copy over my openais configuration from a test cluster. 
Everything seems to work now.

FWIW, upgrading to L&G versions is not the preferred option at this point 
primarily due to the PITFA the kernel invariably creates with its 
incompatible changes to internal APIs.  I have a number of external 
additions to deal with, and not all of them are likely to have been ported 
to the latest kernels.

Anyway, sorry for the noise, but thanks for your time.  Much appreciated.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi at ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+




More information about the Linux-cluster mailing list