[Linux-cluster] GFS2 - F_SETLK fails with "ENOSYS" after umount + mount

Wed Jan 30 11:31:22 UTC 2013

Hi,

I'm setting up a two-node cluster sharing a single GFS2 filesystem
backed by a dual-primary DRBD-device (DRBD on top of LVM, so no CLVM
involved).

I am experiencing more or less the same as the OP in this thread:
http://www.redhat.com/archives/linux-cluster/2010-July/msg00136.html

I have an activemq-5.6.0 instance on each server that tries to lock a
file on the GFS2-filesystem (using ).  

When i start the cluster, everything works as expected. The first
activemq instance that starts up acquires the lock, the lock is released
when the activemq exits, and the second instance takes the lock. 

The problem shows when I unmount and subsequently mount the GFS2
filesystem  again on one of the nodes, or reboot one of the nodes (after
having started at least one activemq instance.) 
The I start seeing statements like this in the activemq log files:

Database /srv/activemq/queue#3a#2f#2fstat.#3e/lock is locked... waiting 10 seconds for the database to be unlocked. Reason: java.io.IOException: Function not implemented | org.apache.activemq.store.kahadb.MessageDatabase

strace -f while that message is logged gives the following:

[pid  3549] stat("/srv/activemq/queue#3a#2f#2fstat.#3e", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
[pid  3549] stat("/srv/activemq/queue#3a#2f#2fstat.#3e", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
[pid  3549] open("/srv/activemq/queue#3a#2f#2fstat.#3e/lock", O_RDWR|O_CREAT, 0666) = 133
[pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid  3549] fcntl(133, F_GETFD)         = 0
[pid  3549] fcntl(133, F_SETFD, FD_CLOEXEC) = 0
[pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid  3549] fcntl(133, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = -1 ENOSYS (Function not implemented)
[pid  3549] dup2(138, 133)              = 133
[pid  3549] close(133)

As you can see, the "Function not implemented" originates from the
F_SETLK fnctl that the JVM does. 
The only way to recover from this state seems to be by unmounting the
GFS2-filesystem on both nodes, then mounting it again again on both
nodes. 

I've tried to isolate this by using a simpler testcase than starting two
activemq instances. I ended up using the java sample from
http://www.javabeat.net/2007/10/locking-files-using-java/ . 

I haven't managed to get the system in to a state where F_SETLK returns
"Function no implemented" by only using the above FileLockTest class, (I
need activemq in order to trigger the situation) but when the system is
in that state, I can run FileLockTest, and it will print out the
following stacktrace.

Exception in thread "main" java.io.IOException: Function not implemented
        at sun.nio.ch.FileChannelImpl.lock0(Native Method)
        at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:871)
        at java.nio.channels.FileChannel.tryLock(FileChannel.java:962)
        at FileLockTest.main(FileLockTest.java:15)

If I run this on the other server (where the GFS2 fs was not unmounted
and mounted again), it works correctly. 

Any ideas to what happens, and why?

BR
Kristian Sørensen