[Linux-cluster] Clearing a glock

Scooter Morris scooter at cgl.ucsf.edu
Tue Jul 27 17:14:47 UTC 2010


  Hi Steve,
     More information.  The offending file was /usr/local/bin/python2.6, 
which we use heavily on all nodes.  Our general use is through the #! 
mechanism in .py files.  Does this offer any clues as to why we had all 
of those processes waiting on a lock with no holder?

-- scooter

On 07/27/2010 06:18 AM, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2010-07-27 at 05:57 -0700, Scooter Morris wrote:
>> On 7/27/10 5:15 AM, Steven Whitehouse wrote:
>>> Hi,
>>>
>>> If you translate a5b67f into decimal, then that is the inode number of
>>> the inode which is causing a problem. It looks to me as if you have too
>>> many processes trying to access this one inode from multiple nodes.
>>>
>>> Its not obvious from the traces that anything is actually stuck, but if
>>> you take two traces, a few seconds or minutes apart, then it should
>>> become more obvious whether the cluster is making progress or whether it
>>> really is stuck,
>>>
>>> Steve.
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> Hi Steve,
>>       As always, thanks for the reply.  The cluster was, indeed, truly
>> stuck.  I rebooted it last night to clear everything out.  I never did
>> figure out which file was the problem.  I did a find -inum, but the find
>> hung too.  By that point the load average was up to 80 and climbing.
>> Any ideas on how to avoid this?  Are there tunable values I need to
>> increase to allow more processes to access any individual inode?
>>
> The LA includes processes waiting for glocks since that is an
> uninterruptible wait, so thats where most of the LA came from.
>
> The find is unlikely to work while the cluster is stuck, since if it
> does find the cuplrit inode, it is, by definition already stuck so the
> find process would just join the queue. If a find fails to discover the
> inode when the cluster has been rebooted and is back working again, then
> it was probably a temporary file of some kind.
>
> There are no tunable values since the limitation on the access to the
> inode is the speed of the hardware in terms of how many times a given
> inode can be synced, invalidated and the glock passed on to another node
> in a given time period. It is a limitation of the hardware and the
> architecture of the filesystem.
>
> There are a few things which can probably be improved in due course, but
> in the main the best way to avoid problems with congestion on inodes is
> just to be careful about the access pattern across nodes.
>
> That said, if it really was completely stuck, that is a real bug and not
> the result of the access pattern since the code is designed such that
> progress should always be made, even if its painfully slow,
>
> Steve.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list