[Linux-cluster] Question about GFS2 and mmap

Scooter Morris scooter at cgl.ucsf.edu
Mon Jan 17 19:06:53 UTC 2011


Steven,
     Thanks for getting back to me.  Yes, I've checked and noatime is 
definitely set.  While blast was running, I did a lockdump and the 
mmaped files had EX locks on them:

G:  s:EX n:2/5497229 f:q t:EX d:EX/0 l:0 a:0 r:3
  I: n:1055314/88699433 t:8 f:0x10 d:0x00000000 s:55237024/55237024

where inode 88699433 is one of the mapped files:

[root at crick blast]# ls -li /databases/mol/blast/db_current/nr.01.pin
88699433 -rw-r--r-- 1 rpcuser sacs 55237024 Jan 17 02:53 
/databases/mol/blast/db_current/nr.01.pin

so that explains the behavior.  What I don't understand is why they had 
EX locks.  I did an strace of the blast, and what I see when the files 
are mmaped is something like:

stat("/databases/mol/blast/db/nr.01.pin", {st_mode=S_IFREG|0644, 
st_size=55237024, ...}) = 0
open("/databases/mol/blast/db/nr.01.pin", O_RDONLY) = 8
mmap(NULL, 55237024, PROT_READ, MAP_SHARED, 8, 0) = 0x2b9ec1a14000

Where /databases/mol/blast is the gfs2 filesystem.  So, the files are 
not opened read/write, and the mmap'ed segment is not read/write.  It's 
not clear why gfs2 would create an exclusive glock for this file?  Does 
this make any sense to you?

-- scooter

On 01/16/2011 07:32 AM, Steven Whitehouse wrote:
> Hi,
>
> On Sat, 2011-01-15 at 16:46 -0800, Scooter Morris wrote:
>> We have a RedHat cluster (5.5 currently) with 3 nodes, and are sharing a
>> number of gfs2 filesystems across all nodes.  One of the applications we
>> run is a standard bioinformatics application called BLAST that searches
>> large indexed files to find similar dna (or protein) sequences.  BLAST
>> will typically mmap a fair amount of data into memory from the index
>> files.  Normally, this significantly speeds up subsequent executions of
>> BLAST.  This doesn't appear to work on gfs2, however, when I involve
>> other nodes.  For example, if I run blast three times on a single node,
>> the first execution is very slow, but subsequent executions are
>> significantly quicker.  If I then run it on another node in the cluster
>> (accessing the same data files over gfs2), the first execution is slow,
>> and subsequent executions are quicker.  This makes sense.  The problem
>> is that when I run it on multiple nodes, the speeds of subsequent runs
>> on the same node are no quicker.  It almost seems as if gfs2 is flushing
>> the in-memory copy (which is read only) immediately when the file is
>> accessed on another node.  Is this the case?  If so, is there a reason
>> for this, or is it a bug?  If it's a known bug, is there a workaround?
>>
>> Any help would be appreciated!  This is a critical application for us.
>>
>> Thanks in advance,
>>
>> -- scooter
>>
> Are you sure that the noatime mount option has been used? I can't figure
> out why that shouldn't work if the BLAST processes are really only
> reading the files and not writing to them.
>
> GFS2 is able to tell the difference between read and write accesses to
> shared, writable mmap()ed files (unlike GFS which has to assume that all
> accesses are write accesses). Some early versions of GFS2 did that too,
> but anything recent (has ->page_mkwrite() in the source) and certainly
> 5.5 does, should be ok.
>
> You can use the glock dump to see what mode the glock associated with
> the mmap()ed inode is in. With RHEL6/Fedora/upstream you can use the
> tracepoints to watch the state dynamically during the operations. I'm
> afraid that isn't available on RHEL5. All you need to know is the inode
> number of the file in question and then look for a type 2 glock with the
> same number.
>
> Let us know if that helps narrow down the issue. BLAST is something that
> I'd like to see running well on GFS2,
>
> Steve.
>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list