[Linux-cluster] Question about GFS2 and mmap

Tue Jan 18 18:59:51 UTC 2011

Hi Steven,

As near as I can tell, the access times aren't getting updated, but when 
I run BLAST on a second node, it also gets an EX lock on the inode.  On 
the other hand, if I run them simultaneously, they both get shared (SH) 
locks (!?!):

node1:
lsof:
blastp    21730    root  mem    REG  253,8   46265880 88699363 
/databases/mol/blast/db_current/nr.00.pin

lockdump:
G:  s:SH n:2/54971e3 f:q t:SH d:EX/0 l:0 a:0 r:3
  I: n:1055306/88699363 t:8 f:0x10 d:0x00000000 s:46265880/46265880

node2:
blastp  18052 root  mem    REG  253,0   46265880 88699363 
/databases/mol/blast/db_current/nr.00.pin
G:  s:SH n:2/54971e3 f:q t:SH d:EX/0 l:0 a:0 r:3
  I: n:1055306/88699363 t:8 f:0x10 d:0x00000000 s:46265880/46265880

It seems like the mmap proactively grabs an exclusive lock (perhaps in 
preparation to update the atime?) if it can.  The problem with this 
strategy, is that it appears as though when the lock gets demoted from 
an exclusive lock to a shared lock, any cached pages are flushed, 
resulting in a significant decrease in repeated performance.  The 
biggest issue from out perspective is what appears to be the flush of 
the cached pages since that a huge impact on the performance of this 
particular application.

Thanks again for your help on this!

-- scooter

On 01/18/2011 03:41 AM, Steven Whitehouse wrote:
> Hi,
>
> On Mon, 2011-01-17 at 11:06 -0800, Scooter Morris wrote:
>> Steven,
>>       Thanks for getting back to me.  Yes, I've checked and noatime is
>> definitely set.  While blast was running, I did a lockdump and the
>> mmaped files had EX locks on them:
>>
>> G:  s:EX n:2/5497229 f:q t:EX d:EX/0 l:0 a:0 r:3
>>    I: n:1055314/88699433 t:8 f:0x10 d:0x00000000 s:55237024/55237024
>>
>> where inode 88699433 is one of the mapped files:
>>
>> [root at crick blast]# ls -li /databases/mol/blast/db_current/nr.01.pin
>> 88699433 -rw-r--r-- 1 rpcuser sacs 55237024 Jan 17 02:53
>> /databases/mol/blast/db_current/nr.01.pin
>>
>> so that explains the behavior.  What I don't understand is why they had
>> EX locks.  I did an strace of the blast, and what I see when the files
>> are mmaped is something like:
>>
>> stat("/databases/mol/blast/db/nr.01.pin", {st_mode=S_IFREG|0644,
>> st_size=55237024, ...}) = 0
>> open("/databases/mol/blast/db/nr.01.pin", O_RDONLY) = 8
>> mmap(NULL, 55237024, PROT_READ, MAP_SHARED, 8, 0) = 0x2b9ec1a14000
>>
>> Where /databases/mol/blast is the gfs2 filesystem.  So, the files are
>> not opened read/write, and the mmap'ed segment is not read/write.  It's
>> not clear why gfs2 would create an exclusive glock for this file?  Does
>> this make any sense to you?
>>
>> -- scooter
>>
> Well it depends on the history of the glock in question. Once a glock
> has been cached in exclusive mode, it will never be dropped unless (a)
> there is memory pressure and the glock is reclaimed or (b) another node
> requests it.
>
> So if there has been only a single node doing read only work on the
> inode and previous to that something on that same node wrote to that
> inode, then it will still be in exclusive mode. A read only request from
> a different node should push the glock into shared mode (on both nodes)
> and if that isn't happening correctly, then it sounds like something has
> gone awry somewhere.
>
> The other thing, is that mmap() itself does grab an exclusive lock on
> the inode - it has to update atime if that is turned on (but it doesn't
> take any locks if atime is turned off as in your case). Note that this
> is only the initial call to mmap() though, and the locks taken during
> page faults are either shared or exclusive according to the type of
> fault (read or write).
>
> If you could check the atime before the mmap and after it, that should
> tell us if there is a problem with the noatime check here.
>
> If that still doesn't work, I'll try and duplicate what you are doing
> with a small test program and see if I can reproduce the problem,
>
> Steve.
>
>
>> On 01/16/2011 07:32 AM, Steven Whitehouse wrote:
>>> Hi,
>>>
>>> On Sat, 2011-01-15 at 16:46 -0800, Scooter Morris wrote:
>>>> We have a RedHat cluster (5.5 currently) with 3 nodes, and are sharing a
>>>> number of gfs2 filesystems across all nodes.  One of the applications we
>>>> run is a standard bioinformatics application called BLAST that searches
>>>> large indexed files to find similar dna (or protein) sequences.  BLAST
>>>> will typically mmap a fair amount of data into memory from the index
>>>> files.  Normally, this significantly speeds up subsequent executions of
>>>> BLAST.  This doesn't appear to work on gfs2, however, when I involve
>>>> other nodes.  For example, if I run blast three times on a single node,
>>>> the first execution is very slow, but subsequent executions are
>>>> significantly quicker.  If I then run it on another node in the cluster
>>>> (accessing the same data files over gfs2), the first execution is slow,
>>>> and subsequent executions are quicker.  This makes sense.  The problem
>>>> is that when I run it on multiple nodes, the speeds of subsequent runs
>>>> on the same node are no quicker.  It almost seems as if gfs2 is flushing
>>>> the in-memory copy (which is read only) immediately when the file is
>>>> accessed on another node.  Is this the case?  If so, is there a reason
>>>> for this, or is it a bug?  If it's a known bug, is there a workaround?
>>>>
>>>> Any help would be appreciated!  This is a critical application for us.
>>>>
>>>> Thanks in advance,
>>>>
>>>> -- scooter
>>>>
>>> Are you sure that the noatime mount option has been used? I can't figure
>>> out why that shouldn't work if the BLAST processes are really only
>>> reading the files and not writing to them.
>>>
>>> GFS2 is able to tell the difference between read and write accesses to
>>> shared, writable mmap()ed files (unlike GFS which has to assume that all
>>> accesses are write accesses). Some early versions of GFS2 did that too,
>>> but anything recent (has ->page_mkwrite() in the source) and certainly
>>> 5.5 does, should be ok.
>>>
>>> You can use the glock dump to see what mode the glock associated with
>>> the mmap()ed inode is in. With RHEL6/Fedora/upstream you can use the
>>> tracepoints to watch the state dynamically during the operations. I'm
>>> afraid that isn't available on RHEL5. All you need to know is the inode
>>> number of the file in question and then look for a type 2 glock with the
>>> same number.
>>>
>>> Let us know if that helps narrow down the issue. BLAST is something that
>>> I'd like to see running well on GFS2,
>>>
>>> Steve.
>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster