[Linux-cluster] Freeze with cluster-2.03.11

Kadlecsik Jozsef kadlec at mail.kfki.hu
Sat Mar 28 11:12:56 UTC 2009


Hi,

On Fri, 27 Mar 2009, Wendy Cheng wrote:

> > I should get some sleep - but can't it be that I hit the potential
> > deadlock mentioned here:
> >
> >       commit  4787e11dc7831f42228b89ba7726fd6f6901a1e3
> > 
> >       gfs-kmod: workaround for potential deadlock. Prefault user pages
[...] 
> >       file. In this case, prefaulting the buffer's pages immediately
> >       before acquiring the glocks significantly shortens the window
> >       for this deadlock. Closing the window any more causes a large
> >       performance hit.
> >
> >  Mailman do mmap files...

> I don't see a strong evidence of deadlock (but it could) from the thread
> backtraces However, assuming the cluster worked before, you could have
> overloaded the e1000 driver in this case. There are suspicious page faults
> but memory is very "ok". So one possibility is that GFS had generated too
> many sync requests that flooded the e1000. As the result, the cluster heart
> beat missed its interval.

It's a possibility. But it assumes also that the node freezes >because< 
it was fenced off. So far nothing indicates that.

> Do you have the same ethernet card for both AOE and cluster traffic ? If 
> yes, seperate them to see how it goes.

Yes, the AOE and cluster traffic shares the same ethernet card. However
with the earlier release whatever high load we had, there was never any 
locking up, freezing problem.

> And of course, if you don't have Ben's mmap patch (as you described in 
> your post), it is probably a good idea to get it into your gfs-kmod.

The patch *is* in the cluster-2.03.11. The comment itself says, it 
shortens the window for the deadlock but does not eliminate that. 

As a possible workaround I moved mailman from GFS to a local disk, 
started it and there was no freeze. The cluster ran for almost seven 
hours, then two nodes died again :-(

> But honestly,  I think running GFS1 on newer kernels is a bad idea.

I see. So do you believe GFS2 is better/ready for production?

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary


More information about the Linux-cluster mailing list