[Linux-cluster] GFS2 - F_SETLK fails with "ENOSYS" after umount + mount

Wed Jan 30 13:17:25 UTC 2013

----- Original Message -----
| Hi,
| 
| I'm setting up a two-node cluster sharing a single GFS2 filesystem
| backed by a dual-primary DRBD-device (DRBD on top of LVM, so no CLVM
| involved).
| 
| I am experiencing more or less the same as the OP in this thread:
| http://www.redhat.com/archives/linux-cluster/2010-July/msg00136.html
| 
| I have an activemq-5.6.0 instance on each server that tries to lock a
| file on the GFS2-filesystem (using ).
| 
| When i start the cluster, everything works as expected. The first
| activemq instance that starts up acquires the lock, the lock is
| released
| when the activemq exits, and the second instance takes the lock.
| 
| The problem shows when I unmount and subsequently mount the GFS2
| filesystem  again on one of the nodes, or reboot one of the nodes
| (after
| having started at least one activemq instance.)
| The I start seeing statements like this in the activemq log files:
| 
| Database /srv/activemq/queue#3a#2f#2fstat.#3e/lock is locked...
| waiting 10 seconds for the database to be unlocked. Reason:
| java.io.IOException: Function not implemented |
| org.apache.activemq.store.kahadb.MessageDatabase
| 
| strace -f while that message is logged gives the following:
| 
| [pid  3549] stat("/srv/activemq/queue#3a#2f#2fstat.#3e",
| {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
| [pid  3549] stat("/srv/activemq/queue#3a#2f#2fstat.#3e",
| {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
| [pid  3549] open("/srv/activemq/queue#3a#2f#2fstat.#3e/lock",
| O_RDWR|O_CREAT, 0666) = 133
| [pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
| [pid  3549] fcntl(133, F_GETFD)         = 0
| [pid  3549] fcntl(133, F_SETFD, FD_CLOEXEC) = 0
| [pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
| [pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
| [pid  3549] fcntl(133, F_SETLK, {type=F_WRLCK, whence=SEEK_SET,
| start=0, len=1}) = -1 ENOSYS (Function not implemented)
| [pid  3549] dup2(138, 133)              = 133
| [pid  3549] close(133)
| 
| As you can see, the "Function not implemented" originates from the
| F_SETLK fnctl that the JVM does.
| The only way to recover from this state seems to be by unmounting the
| GFS2-filesystem on both nodes, then mounting it again again on both
| nodes.
| 
| I've tried to isolate this by using a simpler testcase than starting
| two
| activemq instances. I ended up using the java sample from
| http://www.javabeat.net/2007/10/locking-files-using-java/ .
| 
| I haven't managed to get the system in to a state where F_SETLK
| returns
| "Function no implemented" by only using the above FileLockTest class,
| (I
| need activemq in order to trigger the situation) but when the system
| is
| in that state, I can run FileLockTest, and it will print out the
| following stacktrace.
| 
| Exception in thread "main" java.io.IOException: Function not
| implemented
|         at sun.nio.ch.FileChannelImpl.lock0(Native Method)
|         at
|         sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:871)
|         at
|         java.nio.channels.FileChannel.tryLock(FileChannel.java:962)
|         at FileLockTest.main(FileLockTest.java:15)
| 
| 
| If I run this on the other server (where the GFS2 fs was not
| unmounted
| and mounted again), it works correctly.
| 
| Any ideas to what happens, and why?
| 
| BR
| Kristian Sørensen
Hi Kristian,

After doing some simple checks (which shouldn't be your problem) GFS2
passes all posix lock requests down to the DLM for further processing.
I'm not sure what DLM does with them from there, but I believe the
requests are processed by user space, i.e. openais, etc., depending on
what version you're running.  I recommend checking "dmesg" to see if
there are any pertinent errors logged there. You could also check
/var/log/messages to see if user space logged any complaints.  Also,
you might want to do this command to check for pertinent errors:

group_tool dump gfs

(Now, if it was an flock rather than a posix lock, I could help you
because flocks are handled by GFS2 and not just passed on to DLM).

Regards,

Bob Peterson
Red Hat File Systems