[Linux-cluster] Lock contention on BerkleyDB lock file

Elias, Michael EliasM at dnb.com
Tue Mar 17 05:21:56 UTC 2009


Hi all, 

 

We are trying to investigate an incident where rogue multiple instances
of Jboss caused possible lock contention on a BerkleyDB lock file. 

This is a 5 node cluster, 3 nodes using a BerkelyDB on a GFS partition.

Basic sequence of events;

Node1 appears to be the lock master for the BerkleyDB lock file

 

Node2 starts 4 addition JBoss servers (under investigation to how/why
they started) that are in an uninterruptible sleep state. Attempted
killing the process with kill -9, but not killable as is usual with the
dreaded "D" state. 

 

Node 1 we stop the JBoss running on that server which seem to release
any contention on Node 2, we are able to kill all the JBoss process. We
then restart the JBoss server on Node 1.

 

Node 2 we removed the node from the cluster to release any dlm locks it
is holding. Stopping the following in this order rgmanger, gfs, clvmd,
fenced, cman. 

 

Node 2, we restarted the cluster, cman, fenced, clvmd, gfs.. Once
starting GFS we start receiving dlm_lock errors  and would not mount. At
that point dlm seemed to make gfs unavailable on nodes 1,2,and 3.

At that point we recycled nodes 1,2,and3 and they came up clean.

 

So questions;

1.	In the log info below "ex plock -11"; I believe ex=exclusive
plock=posixlock but what does "-11" mean?
2.	Why would stopping JBoss on node1 release the contention on
node2?
3.	Why would node 2 complain about dlm_lock when restarting all the
cluster services and lockup the cluster
4.	And not necessarily for this group, is it possible for JBoss to
start random servers due to some sort of lock contention.

 

 

Thanks for any help that is offered.

Michael

 

 

 

The log info below is from node2 and is before any corrective action was
taken. It is the state when the additional JBoss processes were running.

 

/proc/cluster/lock_dlm/debug

 

/proc/cluster/lock_dlm/debug:14043 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14043 ex plock -11

/proc/cluster/lock_dlm/debug:14037 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14037 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14037 ex plock -11

/proc/cluster/lock_dlm/debug:14036 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14036 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14036 ex plock -11

/proc/cluster/lock_dlm/debug:14039 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14039 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14039 ex plock -11

/proc/cluster/lock_dlm/debug:22414 en punlock 7,f1021d

/proc/cluster/lock_dlm/debug:14038 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14038 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14038 ex plock -11

/proc/cluster/lock_dlm/debug:14041 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14041 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14041 ex plock -11

/proc/cluster/lock_dlm/debug:14041 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14041 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14041 ex plock -11

/proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14040 ex plock -11

/proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14040 ex plock -11

/proc/cluster/lock_dlm/debug:14037 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14037 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14037 ex plock -11

/proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14040 ex plock -11

/proc/cluster/lock_dlm/debug:14041 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14041 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14041 ex plock -11

/proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14040 ex plock -11

/proc/cluster/lock_dlm/debug:14038 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14038 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14038 ex plock -11

/proc/cluster/lock_dlm/debug:14044 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14044 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14044 ex plock -11

/proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee

/proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1,
pid 1u

/proc/cluster/lock_dlm/debug:14040 ex plock -11

 

That inode hade the following in DLM_locks

 

Resource 00000100b833c508 (parent 0000000000000000). Name (len=24) "
2         25101ee"  

Local Copy, Master is node 1

Granted Queue

154f0156 PR 14040 Master:     8fe80123

Conversion Queue

Waiting Queue

 

 

Resource 00000101da9a9598 (parent 0000000000000000). Name (len=24) "
5         25101ee"  

Master Copy

Granted Queue

1229013e PR 14040

098f0154 PR 19040 Remote:   5 d1fa036e

190c02f1 PR 6587 Remote:   1 12cb023f

8612028c PR 15187 Remote:   3 00160179

Conversion Queue

Waiting Queue

 

Resource 00000101c5d28908 (parent 0000000000000000). Name (len=24) "
11         25101ee"  

Master Copy

Granted Queue

28630181 NL 24849 Remote:   1 8ca6012e

277703b2 NL 32360

86f603bc NL 15187 Remote:   3 0011006d

Conversion Queue

Waiting Queue

 

Resource 00000100eab6c9d8 (parent 0000000000000000). Name (len=24) "
7         25101ee"  

Master Copy

LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

Granted Queue

a7dd00b2 EX 21994 0-0 Remote:   1 8858013a

95a603ef NL 16953

86fd00a7 NL 15187 Remote:   3 000e005e

748702cf NL 8385 Remote:   1 a25e0322

Conversion Queue

Waiting Queue

 

The inode "25101ee"  points to the following file:

38863342 -rw-r--r--  1 nobody nobody    0 Jan 18 13:19 je.lck

 

nobody      13892 34.1 64.1 8238260 5138044 ?   Sl   Mar01 6328:11
/usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server
-Xms6144m -Xmx6144m -XX:MaxPermSize=256m
-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000
-Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed
-classpath
/xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool
s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default
-Djboss.partition.name=PROD1-yyy1

nobody      21803  0.0 53.8 8223868 4311272 ?   D    17:59   0:00
/usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server
-Xms6144m -Xmx6144m -XX:MaxPermSize=256m
-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000
-Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed
-classpath
/xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool
s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default
-Djboss.partition.name=PROD1-yyy1

nobody      21808  0.0 53.8 8223868 4311496 ?   D    17:59   0:00
/usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server
-Xms6144m -Xmx6144m -XX:MaxPermSize=256m
-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000
-Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed
-classpath
/xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool
s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default
-Djboss.partition.name=PROD1-yyy1

nobody      21809  0.0 53.8 8223868 4311524 ?   D    17:59   0:00
/usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server
-Xms6144m -Xmx6144m -XX:MaxPermSize=256m
-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000
-Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed
-classpath
/xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool
s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default
-Djboss.partition.name=PROD1-yyy1

nobody      22414  0.0 55.2 8232092 4419488 ?   D    18:03   0:00
/usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server
-Xms6144m -Xmx6144m -XX:MaxPermSize=256m
-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000
-Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed
-classpath
/xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool
s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default
-Djboss.partition.name=PROD1-yyy1

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090317/2139cae9/attachment.htm>


More information about the Linux-cluster mailing list