[Linux-cluster] node joining still not working 100% in 3node cluster

Wed Aug 13 18:17:58 UTC 2008

Thought I had it worked out, but things still not working 100%.

Setup: 3node gfs cluster, each node has 1 vote, and quorum disk has 2
votes. Cluster is up and running with no problem. I then reboot 1
node. For troubleshooting purposes, I turned gfs off from default
startup so i can start it manually (cman and qdiskd is still
automatically started).
nodes are 2,3 and 4. Node2 is being restarted.

The logs all show node leaving successfully. Quorum is 3, expected
votes 5, total votes 4 (once node has shut down).
Node2 restarted, cman and qdiskd starts up. at this point, cluster
services show everything back to normal. Output from cman_tool status
on all 3 nodes is the same, with no errors (output abbreviated here).

# cman_tool status
Config Version: 5
Nodes: 3
Expected votes: 5
Total votes: 5
Quorum: 3
Active subsystems: 7

However, when I run service gfs start (or try and mount my first gfs
volume), it just hangs. My logs on node 2 show the following:
Aug 13 19:51:42 blade2 gfs_controld[2825]: retrieve_plocks: ckpt open
error 12 cache1
Aug 13 19:51:42 blade2 kernel: GFS 0.1.19-7.el5 (built Nov 12 2007
14:43:37) installed
Aug 13 19:51:42 blade2 kernel: Trying to join cluster "lock_dlm",
"jemdevcluster:cache1"
Aug 13 19:51:42 blade2 kernel: dlm: Using TCP for communications
Aug 13 19:51:42 blade2 kernel: dlm: got connection from 3
Aug 13 19:51:42 blade2 kernel: dlm: connecting to 4
Aug 13 19:51:42 blade2 kernel: dlm: got connection from 4
Aug 13 19:51:42 blade2 kernel: dlm: connecting to 4

At this point, mount.gfs just hangs. Restarting node2 causes the same
thing to happen over and over, and am not able to get the 2 gfs
volumes mounted. Nodes3 & 4 can still access the filesystem however.

After a 2nd reboot, my logs show...
Aug 13 20:13:08 blade2 qdiskd[2873]: <info> Node 3 is the master
Aug 13 20:13:09 blade2 gfs_controld[2834]: retrieve_plocks: ckpt open
error 12 cache1
Aug 13 20:13:09 blade2 kernel: GFS 0.1.19-7.el5 (built Nov 12 2007
14:43:37) installed
Aug 13 20:13:09 blade2 kernel: Trying to join cluster "lock_dlm",
"jemdevcluster:cache1"
Aug 13 20:13:09 blade2 kernel: dlm: Using TCP for communications
Aug 13 20:13:09 blade2 kernel: dlm: connecting to 3
Aug 13 20:13:09 blade2 kernel: dlm: got connection from 3
Aug 13 20:13:09 blade2 kernel: dlm: connecting to 3
Aug 13 20:13:09 blade2 kernel: dlm: got connection from 4

Both 3 & 4 show the following in the logs:
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] Members Joined:
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 13 20:14:17 blade4 openais[2554]: [SYNC ] This node is within the
primary component and will provide service.
Aug 13 20:14:17 blade4 openais[2554]: [TOTEM] entering OPERATIONAL state.
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] got nodejoin message
192.168.70.102
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] got nodejoin message
192.168.70.103
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] got nodejoin message
192.168.70.104
Aug 13 20:14:17 blade4 openais[2554]: [CPG  ] got joinlist message from node 4
Aug 13 20:14:17 blade4 openais[2554]: [CPG  ] got joinlist message from node 3
Aug 13 20:14:33 blade4 kernel: dlm: connecting to 2

Surely node2 should connect to 3, get connection from 3 and then
connect to 4 and get connection from 4?
Could this possibly be a gfs bug?

Brett