[Linux-cluster] GFS2: processes stuck in "just schedule"
swhiteho at redhat.com
Fri Dec 4 09:39:01 UTC 2009
On Thu, 2009-12-03 at 17:30 -0500, Allen Belletti wrote:
> Hi All,
> After Steve and the RedHat guys dug into my nasty crashdump (thanks
> all!), I believe I'm down to the last GFS2 problem on our mail cluster,
> but it's a common one.
> I've always had trouble with processes getting stuck on GFS2 access and
> queuing up. Since the 5.4 upgrade and moving the proper GFS2 kernel
> module, it's changed but not gone away. Ever few days now, I'm seeing
> processes getting stuck with WCHAN=just_schedule. Once this starts
> happening, both cluster nodes will accumulate them rapidly which
> eventually brings IO to a halt. The only way I've found to escape is
> via a reboot, sometimes of one, sometimes of both nodes.
> Since there's no crash, I don't get any useful debug information.
> Outside of this one repeating glitch, performance is great and all is
> well. If anyone can suggest ways of gathering more data about the
> problem, or possible solutions, I would be grateful.
This would be typical for what happens when there is contention on a
glock between two (or more) nodes. There is a mechanism which is
supposed to try and mitigate the issue (by allowing each node to hold on
to a glock for a minimum period of time which is designed to ensure that
some work is done each time a node acquires a glock) but if your storage
is particularly slow, and/or possibly depending upon the exact I/O
pattern, it may not always be 100% effective.
In the first instance though, see if you can find an inode which is
being contended from both nodes as that will most likely be the culprit,
More information about the Linux-cluster