[Linux-cluster] Problem in clvmd/dlm_recoverd

Wed Nov 19 04:30:33 UTC 2008

On 19/11/2008, at 2:06 AM, David Teigland wrote:

> On Tue, Nov 18, 2008 at 05:14:38PM +1030, Tom Lanyon wrote:
>> We seem to be having the same problem on a 5 node virtual cluster
>> where 3 of the nodes share a GFS mount.
>>
>> A backup script runs on one node which does some heavy reads + writes
>> to this mount at which point all three nodes jump to 100% cpu (90%
>> iowait on the machine that is doing the backup, 100% system on the
>> other two) and all LVM VGs, LVs and GFS mounts lock up.
>
> Which process was using 100% cpu?  If it was groupd, fenced,  
> dlm_controld
> or gfs_controld, then yes it may be the same problem.
>
>> Is there anything that could be tuned here to avoid this issue  
>> until a
>> bug fix is released?
>
> I don't think there's any way to avoid the bug in the bz I referenced.
>
> Dave

We haven't been able to catch it quick enough to determine which  
process is using all CPU.

The other option is that we're just seeing a huge amount of glocks  
created on the node running backups and all others (webservers) are  
just hanging whilst trying to access files. I've just done some fairly  
aggressive tuning of the GFS mounts on all nodes; hopefully this fixes  
it!

Regards,
Tom