[Linux-cluster] GFS + DRBD Problems

Tue Mar 4 22:14:09 UTC 2008

As I thought, the problem I'm seeing is indeed rather multi-part. The 
first part is now resolved - large time-skips due to the system clock 
being out of date until ntpd syncs it up. It seems that large time jumps 
made dlm choke.

Now for part 2:

The two nodes connect - certainly enough to sync up DRBD. That stage 
goes through fine. They start cman and other cluster components, but it 
would appear then never actually find each other.

When mounting the shared file system:

Node 1:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 54 of 197 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 54, skips = 36, sames = 107
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 7 IDs
GFS: fsid=sentinel:root.0: Done

Node 2:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 6 of 6 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 6, skips = 0, sames = 0
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 2 IDs
GFS: fsid=sentinel:root.0: Done

Unless I'm reading this wrong, they are both trying to use JID 0.

The second node to join generally chokes at some point during the boot, 
but AFTER it mounted the GFS volume. On the booted node, cman_tool 
status says:

# cman_tool status
Version: 6.0.1
Config Version: 20
Cluster Name: sentinel
Cluster Id: 28150
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: sentinel1c
Node ID: 1
Multicast addresses: 239.192.109.100
Node addresses: 10.0.0.1

So the second node never joined.
I know for a fact that the network connection between them is working, 
as they sync DRBD.

cluster.conf is here:

<?xml version="1.0"?>
<cluster config_version="20" name="sentinel">
         <cman two_node="1" expected_votes="1"/>
         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="sentinel1c" nodeid="1" votes="1">
                         <com_info>
                                 <rootsource name="drbd"/>
                                 <!--<chrootenv  mountpoint      = 
"/var/comoonics/chroot"
                                                 fstype          = "ext3"
                                                 device          = 
"/dev/sda2"
                                                 chrootdir       = 
"/var/comoonics/chroot"
                                 />-->
                                 <syslog name="localhost"/>
                                 <rootvolume     name            = 
"/dev/drbd1"
                                                 mountopts       = 
"defaults,noatime,nodiratime,noquota"
                                 />
                                 <eth    name    = "eth0"
                                         ip      = "10.0.0.1"
                                         mac     = "00:0B:DB:92:C5:E1"
                                         mask    = "255.255.255.0"
                                         gateway = ""
                                 />
                                 <fenceackserver user    = "root"
                                                 passwd  = "password"
                                 />
                         </com_info>
                         <fence>
                                 <method name = "1">
                                         <device name = "sentinel1d"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="sentinel2c" nodeid="2" votes="1">
                         <com_info>
                                 <rootsource name="drbd"/>
                                 <!--<chrootenv  mountpoint      = 
"/var/comoonics/chroot"
                                                 fstype          = "ext3"
                                                 device          = 
"/dev/sda2"
                                                 chrootdir       = 
"/var/comoonics/chroot"
                                 />-->
                                 <syslog name="localhost"/>
                                 <rootvolume     name            = 
"/dev/drbd1"
                                                 mountopts       = 
"defaults,noatime,nodiratime,noquota"
                                 />
                                 <eth    name    = "eth0"
                                         ip      = "10.0.0.2"
                                         mac     = "00:0B:DB:90:4E:1B"
                                         mask    = "255.255.255.0"
                                         gateway = ""
                                 />
                                 <fenceackserver user    = "root"
                                                 passwd  = "password"
                                 />
                         </com_info>
                         <fence>
                                 <method name = "1">
                                         <device name = "sentinel2d"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman/>
         <fencedevices>
                 <fencedevice agent="fence_drac" 
ipaddr="192.168.254.252" login="root" name="sentinel1d" passwd="password"/>
                 <fencedevice agent="fence_drac" 
ipaddr="192.168.254.253" login="root" name="sentinel2d" passwd="password"/>
         </fencedevices>
         <rm>
                 <failoverdomains/>
                 <resources/>
         </rm>
</cluster>

What could be causing the nodes to not join in the cluster?

Gordan