[Linux-cluster] Problem in clvmd/dlm_recoverd

Tue Nov 18 15:36:19 UTC 2008

On Tue, Nov 18, 2008 at 05:14:38PM +1030, Tom Lanyon wrote:
> On 15/11/2008, at 8:35 AM, David Teigland wrote:
> 
> >On Fri, Nov 14, 2008 at 09:53:13PM +0000, Nuno Fernandes wrote:
> >>>On Fri, Nov 14, 2008 at 10:00:13AM +0000, Nuno Fernandes wrote:
> >>>dlm recovery appears to be stuck; this is usually due to a problem  
> >>>at the
> >>>network level.  The recovery seems to be caused by a node starting  
> >>>clvmd.
> >>Hi,
> >>
> >>I don't know if it helps, but groupd is using all available CPU, but
> >>only in 2 of the nodes.
> >
> >That sounds like https://bugzilla.redhat.com/show_bug.cgi?id=444529
> >which is fixed in 5.3.  I suspect that's the cause of you're problems.
> >
> >Dave
> 
> 
> We seem to be having the same problem on a 5 node virtual cluster  
> where 3 of the nodes share a GFS mount.
> 
> A backup script runs on one node which does some heavy reads + writes  
> to this mount at which point all three nodes jump to 100% cpu (90%  
> iowait on the machine that is doing the backup, 100% system on the  
> other two) and all LVM VGs, LVs and GFS mounts lock up.

Which process was using 100% cpu?  If it was groupd, fenced, dlm_controld
or gfs_controld, then yes it may be the same problem.

> Is there anything that could be tuned here to avoid this issue until a  
> bug fix is released?

I don't think there's any way to avoid the bug in the bz I referenced.

Dave