[Linux-cluster] DLM Problem

isplist at logicore.net isplist at logicore.net
Wed Jan 30 16:21:40 UTC 2008


So, I at least now know for sure that this is a locking issue. 

> That's not showing available, that's showing it already in use.

I also used telnet to connect to a few of the machines and was able to get in. 
I don't know enough about what I'm seeing on the netstat command but it almost 
looks like redirection to 6809?
It's confusing as heck to me since I didn't change anything, just the storage 
then fired it all back up again. 

> 6809 on the other end is very suspicious, like there is maybe some
> confusion about cman & dlm ports going on. Check cluster.conf and your
> startup scripts for port changing things. It's very unusual. Also check
> that all the nodes are using the same configuration.

I checked the cluster.conf file, don't see anything obvious so changed the 
version number and ran an update to all nodes just to be safe. I'm rebooting 
the nodes now, one at a time.

On my workstation console log, I see;

Jan 29 21:22:40 compdev kernel: GFS: fsid=compweb:web.0: jid=0: Trying to 
acquire journal lock...
Jan 29 21:22:40 compdev kernel: GFS: fsid=compweb:web.0: jid=0: Looking at 
journal...
Jan 29 21:22:40 compdev kernel: GFS: fsid=compweb:web.0: jid=0: Done
Jan 30 08:50:36 compdev kernel: GFS: fsid=compweb:web.0: jid=3: Trying to 
acquire journal lock...
Jan 30 08:50:36 compdev kernel: GFS: fsid=compweb:web.0: jid=3: Busy
Jan 30 08:57:38 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Trying to 
acquire journal lock...
Jan 30 08:57:38 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Busy
Jan 30 09:05:29 compdev kernel: GFS: fsid=compweb:web.0: jid=2: Trying to 
acquire journal lock...
Jan 30 09:05:29 compdev kernel: GFS: fsid=compweb:web.0: jid=2: Busy
Jan 30 10:06:10 compdev kernel: CMAN: node cweb92 has been removed from the 
cluster : Missed too many heartbeats
Jan 30 10:08:14 compdev kernel: CMAN: node cweb92 rejoining
Jan 30 10:08:17 compdev kernel: dlm: could not bind to local address for 
connect: -98
Jan 30 10:10:26 compdev kernel: CMAN: node img63 has been removed from the 
cluster : Missed too many heartbeats
Jan 30 10:12:53 compdev kernel: CMAN: node img63 rejoining
Jan 30 10:12:57 compdev kernel: dlm: could not bind to local address for 
connect: -98
Jan 30 10:17:43 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Trying to 
acquire journal lock...
Jan 30 10:17:43 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Looking at 
journal...
Jan 30 10:19:11 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Acquiring the 
transaction lock...
Jan 30 10:19:11 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Replaying 
journal...
Jan 30 10:19:11 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Replayed 0 of 
22 blocks
Jan 30 10:19:11 compdev kernel: GFS: fsid=compweb:web.0: jid=1: replays = 0, 
skips = 0, sames = 22
Jan 30 10:19:11 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Journal 
replayed in 1s
Jan 30 10:19:11 compdev kernel: GFS: fsid=compweb:web.0: jid=1: Done

I looked at some of the other nodes and they all show similar things. This 
seems to show that port 21064 is available on all nodes. 

#ssh 192.168.1.40 netstat -anp | grep 21064

tcp 0 0 192.168.1.40:21064 0.0.0.0:*          LISTEN      -
tcp 0 0 192.168.1.40:6809  192.168.1.58:21064 ESTABLISHED -
tcp 0 0 192.168.1.40:21064 192.168.1.92:33123 ESTABLISHED -
tcp 0 0 192.168.1.40:21064 192.168.1.62:32779 ESTABLISHED -
tcp 0 0 192.168.1.40:21064 192.168.1.63:6809  ESTABLISHED -

#ssh 192.168.1.62 netstat -anp | grep 21064
tcp 0 0 192.168.1.62:21064 0.0.0.0:*          LISTEN      -
tcp 0 0 192.168.1.62:21064 192.168.1.63:32774 ESTABLISHED -
tcp 0 0 192.168.1.62:6809  192.168.1.58:21064 ESTABLISHED -
tcp 0 0 192.168.1.62:32773 192.168.1.92:21064 ESTABLISHED -
tcp 0 0 192.168.1.62:32780 192.168.1.63:21064 ESTABLISHED -
tcp 0 0 192.168.1.62:21064 192.168.1.58:6809  ESTABLISHED -
tcp 0 0 192.168.1.62:32779 192.168.1.40:21064 ESTABLISHED -

#ssh 192.168.1.63 netstat -anp | grep 21064

tcp 0 0 192.168.1.63:21064 0.0.0.0:*          LISTEN      -
tcp 0 0 192.168.1.63:6809  192.168.1.40:21064 ESTABLISHED -
tcp 0 0 192.168.1.63:21064 192.168.1.62:32780 ESTABLISHED -
tcp 0 0 192.168.1.63:21064 192.168.1.92:33157 ESTABLISHED -
tcp 0 0 192.168.1.63:32774 192.168.1.62:21064 ESTABLISHED -
tcp 0 0 192.168.1.63:32780 192.168.1.58:21064 ESTABLISHED -

#ssh 192.168.1.92 netstat -anp | grep 21064
tcp 0 0 192.168.1.92:21064 0.0.0.0:*          LISTEN      -
tcp 0 0 192.168.1.92:6809  192.168.1.58:21064 ESTABLISHED -
tcp 0 0 192.168.1.92:21064 192.168.1.62:32773 ESTABLISHED -
tcp 0 0 192.168.1.92:33157 192.168.1.63:21064 ESTABLISHED -
tcp 0 0 192.168.1.92:33123 192.168.1.40:21064 ESTABLISHED -






More information about the Linux-cluster mailing list