[Linux-cluster] Clearing a glock

Scooter Morris scooter at cgl.ucsf.edu
Tue Jul 27 17:45:33 UTC 2010


  On 07/27/2010 10:35 AM, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2010-07-27 at 10:14 -0700, Scooter Morris wrote:
>> Hi Steve,
>>       More information.  The offending file was /usr/local/bin/python2.6,
>> which we use heavily on all nodes.  Our general use is through the #!
>> mechanism in .py files.  Does this offer any clues as to why we had all
>> of those processes waiting on a lock with no holder?
>>
>> -- scooter
>>
> Not really. I'd have expected that to be mapped read-only on the nodes
> and there should be no write activity to it at all, so it should scale
> very well. Did you set noatime?
Yes.
> I can't think of any other reason why that should have been an issue,
Neither could I.  Well, we'll let it ride for now, but if it repeats, I 
file a bug and open a case with RedHat support (and move the binary off 
of gfs2).

-- scooter
> Steve.
>
>> On 07/27/2010 06:18 AM, Steven Whitehouse wrote:
>>> Hi,
>>>
>>> On Tue, 2010-07-27 at 05:57 -0700, Scooter Morris wrote:
>>>> On 7/27/10 5:15 AM, Steven Whitehouse wrote:
>>>>> Hi,
>>>>>
>>>>> If you translate a5b67f into decimal, then that is the inode number of
>>>>> the inode which is causing a problem. It looks to me as if you have too
>>>>> many processes trying to access this one inode from multiple nodes.
>>>>>
>>>>> Its not obvious from the traces that anything is actually stuck, but if
>>>>> you take two traces, a few seconds or minutes apart, then it should
>>>>> become more obvious whether the cluster is making progress or whether it
>>>>> really is stuck,
>>>>>
>>>>> Steve.
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>> Hi Steve,
>>>>        As always, thanks for the reply.  The cluster was, indeed, truly
>>>> stuck.  I rebooted it last night to clear everything out.  I never did
>>>> figure out which file was the problem.  I did a find -inum, but the find
>>>> hung too.  By that point the load average was up to 80 and climbing.
>>>> Any ideas on how to avoid this?  Are there tunable values I need to
>>>> increase to allow more processes to access any individual inode?
>>>>
>>> The LA includes processes waiting for glocks since that is an
>>> uninterruptible wait, so thats where most of the LA came from.
>>>
>>> The find is unlikely to work while the cluster is stuck, since if it
>>> does find the cuplrit inode, it is, by definition already stuck so the
>>> find process would just join the queue. If a find fails to discover the
>>> inode when the cluster has been rebooted and is back working again, then
>>> it was probably a temporary file of some kind.
>>>
>>> There are no tunable values since the limitation on the access to the
>>> inode is the speed of the hardware in terms of how many times a given
>>> inode can be synced, invalidated and the glock passed on to another node
>>> in a given time period. It is a limitation of the hardware and the
>>> architecture of the filesystem.
>>>
>>> There are a few things which can probably be improved in due course, but
>>> in the main the best way to avoid problems with congestion on inodes is
>>> just to be careful about the access pattern across nodes.
>>>
>>> That said, if it really was completely stuck, that is a real bug and not
>>> the result of the access pattern since the code is designed such that
>>> progress should always be made, even if its painfully slow,
>>>
>>> Steve.
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list