[Linux-cluster] How to decrease DLM priority during log scanning

Wed Jun 29 06:51:19 UTC 2005

Hi,

this is a follow up of my previous mail, "Node fenced when mounting 
gfs", in which mounting a particular GFS volume with lots of files can 
cause a node (node2) to appear hang, and thus fenced by the other node 
(node1).

Searching the archive I found a relevant thread, "node kicked out of 
cluster", in which Patrick Caulfield comments
"DLM can hog the CPU when recovering huge numbers of locks, so we a re 
looking into placing some strategic
"schedule()" calls in the recovery process."

This seems to be the case in my problem, since top shows near 100% 
system time. BTW, my system is a dual Xeon box.

On another mail thread, "Configuring CMAN timer/timeout values", I found 
a possible workaround by modifying  /proc/cluster/config/cman/. 
Increasing deadnode_timeout (i tried 2100) prevents the node2 from being 
fenced, but now node1's performance dropped significantly even when it's 
CPU load is very low (e.g. other servers mounting NFS from node2 keeps 
getting NFS timeout errors). Am I right to assume that GFS requires 
locking on both nodes during writing? If it does, this makes sence since 
node2 is too busy "scanning log elements" to respond to anything. After 
over 30 minutes node2 still hasn't finished "scanning log elements", so 
I changed /proc/cluster/config/cman/deadnode_timeout on node1 back to 
its default value (21) and node 2 gets fenced automatically.

So the questions are:
-    is it normal for a "scanning log elements" process to take over 30 
minutes?
-    is there a method to make "scanning log elements" uses lower 
priority (e.g. lowering the priority of DLM when it's recovering locks) ?

Regards,

Fajar