[Linux-cluster] DLM/CLVM problem

Wed Sep 22 03:47:42 UTC 2004

On Tue, Sep 21, 2004 at 08:39:21PM +0200, Lazar Obradovic wrote:
> Hi, 
> 
> I more often then not have a problem when starting clvmd. It starts
> normaly, but /proc/cluster/services, says: 
> 
> # cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           2   2 run       -
> [5 7 6 4 2 3 1]
> 
> DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,7
> []
> 
> 
> while other nodes report: 
> 
> # cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           2   2 run       -
> [4 2 5 3 6 7]
> 
> DLM Lock Space:  "clvmd"                             1   3 update    U-4,1,7
> [4 2 5 3 6 7]
> 
> vgchage will hung afterwards and only reboot would (eventualy) fix the
> problem. Other nodes are working just fine in the meantime... 

> What do "code" flags *exactly* mean?

for update events begining with "U-"
4 = ue_state = UEST_JSTART_SERVICEWAIT
1 = ue_flags = UEFL_ALLOW_STARTDONE
7 = ue_nodeid = nodeid of node joining or leaving the sg

SM is waiting for the dlm service to complete recovery.  The dlm on nodes
[4 2 5 3 6 7] is still in the process of recovery due to node 7 joining
the lockspace.  If it stays this way for long, it probably means that dlm
recovery is hung for some reason.  dmesg or /proc/cluster/dlm_debug should
show roughly how far the dlm recovery got.

for service events begining with "S-"
1 = se_state = SEST_JOIN_BEGIN
80 = se_flags = SEFL_DELAY
7 = se_reply_count = number of replies received

SM will not permit this node to join the lockspace because the lockspace
in question is still doing recovery.  Once recovery completes, this node
will go ahead and join.

-- 
Dave Teigland  <teigland at redhat.com>