[Linux-cluster] dlm high cpu on latest stock centos 5.1 kernel

Thu Apr 3 22:35:41 UTC 2008

Dave,

Thanks for the update. I had considered that and I'm setup to be able to
do it. Now that someone else has tried with positive results, I think I'll
give it a try.

Thanks,
-Andrew

On Thu, April 3, 2008 3:29 pm, David Ayre wrote:
> Some progress...
>
> We had another dlm_sendd lockup yesterday which prompted us to do some
> reworking of our file sharing.  Previously we had both SMB and NFS
> services competing for GFS resources on this particular node.   We
> thought perhaps it was this combination which may have provoked the
> lockups... so, we moved things around with the help of another server
> in our GFS cluster.
>
> Previously we had:
>
> Machine A (nfs and smb services sitting on top of gfs)
> NFS  SMB
> GFS
>
> And switched things around to this:
>
> Machine A
> SMB
> NFS -> Machine B
>
> Machine B
> NFS
> GFS
>
> Basically we moved all NFS mounts to machine B.... NFS is the only
> file sharing service using GFS on this machine, and changed Machine A
> to use an NFS mount to machine B.   This way we don't have any nodes
> with both SMB and NFS services running on top of GFS.
>
> Previously we had 1-2 lockups a day, but today nothing... so far so
> good.   Not sure if this configuration will work for you... let me
> know if you need any further clarification.
>
> d
>
>
> On 1-Apr-08, at 5:51 PM, Andrew A. Neuschwander wrote:
>
>> My symptoms are similar. dlm_send sits on all of the cpu. Top shows
>> the
>> cpu spending nearly all of it's time in sys or interrupt handling.
>> Disk
>> and network I/O isn't very high (as seen via iostat and iptraf). But
>> SMB/NFS throughput and latency are horrible. Context switches per
>> second
>> as seen by vmstat are in the 20,000+ range (I don't now if this is
>> high
>> though, I haven't really paid attention to this in the past). Nothing
>> crashes, and it is still able to serve data (very slowly), and
>> eventually
>> the load and latency recovers.
>>
>> As an aside, does anyone know how to _view_ the resource group size
>> after
>> file system creation on GFS?
>>
>> Thanks,
>> -Andrew
>>
>>
>> On Tue, April 1, 2008 6:30 pm, David Ayre wrote:
>>> What do you mean by pounded exactly ?
>>>
>>> We have an ongoing issue, similar... when we have about a dozen users
>>> using both smb/nfs, and at some seemingly random point in time our
>>> dlm_senddd chews up 100% of the CPU... then dies down at on its own
>>> after quite a while.  Killing SMB processes, shutting down SMB didn't
>>> seem to have any affect... only a reboot cures it.  I've seen this
>>> described (if this is the same issue) as a "soft lockup" as it does
>>> seem to come back to life:
>>>
>>> http://lkml.org/lkml/2007/10/4/137
>>>
>>> We've been assuming its a kernel/dlm version as we are running
>>> 2.6.9-55.0.6.ELsmp with dlm-kernel 2.6.9-46.16.0.8
>>>
>>> we were going to try a kernel update this week... but you seem to be
>>> using a later version and still have this problem ?
>>>
>>> Could you elaborate on "getting pounded by dlm" ?  I've posted about
>>> this on this list in the past but received no assistance.
>>>
>>>
>>>
>>>
>>> On 1-Apr-08, at 5:19 PM, Andrew A. Neuschwander wrote:
>>>
>>>> I have a GFS cluster with one node serving files via smb and nfs.
>>>> Under
>>>> fairly light usage (5-10 users) the cpu is getting pounded by dlm. I
>>>> am
>>>> using CentOS5.1 with the included kernel (2.6.18-53.1.14.el5). This
>>>> sounds
>>>> like the dlm issue mentioned back in March of last year
>>>> (https://www.redhat.com/archives/linux-cluster/2007-March/msg00068.html
>>>> )
>>>> that was resolved in 2.6.21.
>>>>
>>>> Has (or will) this fix be back ported to the current el5 kernel?
>>>> Will it
>>>> be in RHEL5.2? What is the easiest way for me to get this fix?
>>>>
>>>> Also, if I try a newer kernel on this node, will there be any harm
>>>> in the
>>>> other nodes using their current kernel?
>>>>
>>>> Thanks,
>>>> -Andrew
>>>> --
>>>> Andrew A. Neuschwander, RHCE
>>>> Linux Systems Administrator
>>>> Numerical Terradynamic Simulation Group
>>>> College of Forestry and Conservation
>>>> The University of Montana
>>>> http://www.ntsg.umt.edu
>>>> andrew at ntsg.umt.edu - 406.243.6310
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> ~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~
>>> David Ayre
>>> Programmer/Analyst - Information Technlogy Services
>>> Emily Carr Institute of Art and Design
>>> Vancouver, B.C.   Canada
>>> 604-844-3875 /  david at eciad.ca
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> ~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~
> David Ayre
> Programmer/Analyst - Information Technlogy Services
> Emily Carr Institute of Art and Design
> Vancouver, B.C.   Canada
> 604-844-3875 /  david at eciad.ca
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>

-- 
Andrew A. Neuschwander, RHCE
Linux Systems Administrator
Numerical Terradynamic Simulation Group
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
andrew at ntsg.umt.edu - 406.243.6310