[Linux-cluster] cluster failed after 53 hours

Marcelo Matus mmatus at dinha.acms.arizona.edu
Fri Jan 21 23:54:42 UTC 2005

We also have some crashes when writting very large files, 5GB or so,
and it seems the problem occurs when we hit the GFS cache limit, where
the machine memory is 4GB (Dual Opteron).

Is there a way to tune the GFS cache to use less memory, let say a maximum
512MB, so we can debug the problem better?

And it is either the remote GFS cache or GNBD, since we can write 8GB or 
files when GFS is mounted locally, ie, when we do the tests in the same 
that exports the GFS device, via GNBD, to the rest of the nodes.


Patrick Caulfield wrote:

>On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote:
>>My 3 node cluster ran tests for 53 hours before hitting a problem.
>Attached is a patch to set the CMAN process to run at realtime priority, I'm not
>sure if that's the right thing to do or not to be honest.
>Neither am I sure whether your 48-53 hours is significant - it's possible that
>memory may be an issue (only guessing but GFS caches locks like crazy, it may be
>worth cutting this down a bit by tweaking
>/proc/cluster/lock_dlm/drop_count    and/or
>otherwise, the only way were gpoing to get to the bottom of this is to enable
>"DEBUG_MEMB" in cman and see what it thinks is going on when the node is kicked
>out of the cluster.
>Linux-cluster mailing list
>Linux-cluster at redhat.com

More information about the Linux-cluster mailing list