[Linux-cluster] GFS + DRBD Problems

Wed Mar 5 17:10:20 UTC 2008

You're sure the times are equal on both nodes even written back to hwclock?
ntpdate <time-server> && hwclock --systohc
Cause I had exactly the same behaviour last week where only the times between 
the nodes were different. They would not get fenced, imediately after the 
second node joined the cluster the first one "lost connection to node 1" and 
all cluster services just vanished on that node, the filesystem was still 
mounted (on both nodes) and so on.

After setting times to normal on both nodes everything was working as 
expected.

Marc.
On Tuesday 04 March 2008 23:35:21 Gordan Bobic wrote:
> Gordan Bobic wrote:
> > As I thought, the problem I'm seeing is indeed rather multi-part. The
> > first part is now resolved - large time-skips due to the system clock
> > being out of date until ntpd syncs it up. It seems that large time jumps
> > made dlm choke.
> >
> > Now for part 2:
> >
> > The two nodes connect - certainly enough to sync up DRBD. That stage
> > goes through fine. They start cman and other cluster components, but it
> > would appear then never actually find each other.
> >
> > When mounting the shared file system:
> >
> > Node 1:
> > GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
> > GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
> > GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
> > GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
> > GFS: fsid=sentinel:root.0: jid=0: Replayed 54 of 197 blocks
> > GFS: fsid=sentinel:root.0: jid=0: replays = 54, skips = 36, sames = 107
> > GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
> > GFS: fsid=sentinel:root.0: jid=0: Done
> > GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
> > GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
> > GFS: fsid=sentinel:root.0: jid=1: Done
> > GFS: fsid=sentinel:root.0: Scanning for log elements...
> > GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
> > GFS: fsid=sentinel:root.0: Found quota changes for 7 IDs
> > GFS: fsid=sentinel:root.0: Done
> >
> >
> > Node 2:
> > GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
> > GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
> > GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
> > GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
> > GFS: fsid=sentinel:root.0: jid=0: Replayed 6 of 6 blocks
> > GFS: fsid=sentinel:root.0: jid=0: replays = 6, skips = 0, sames = 0
> > GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
> > GFS: fsid=sentinel:root.0: jid=0: Done
> > GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
> > GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
> > GFS: fsid=sentinel:root.0: jid=1: Done
> > GFS: fsid=sentinel:root.0: Scanning for log elements...
> > GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
> > GFS: fsid=sentinel:root.0: Found quota changes for 2 IDs
> > GFS: fsid=sentinel:root.0: Done
> >
> > Unless I'm reading this wrong, they are both trying to use JID 0.
> >
> > The second node to join generally chokes at some point during the boot,
> > but AFTER it mounted the GFS volume. On the booted node, cman_tool
> > status says:
> >
> > # cman_tool status
> > Version: 6.0.1
> > Config Version: 20
> > Cluster Name: sentinel
> > Cluster Id: 28150
> > Cluster Member: Yes
> > Cluster Generation: 4
> > Membership state: Cluster-Member
> > Nodes: 1
> > Expected votes: 1
> > Total votes: 1
> > Quorum: 1
> > Active subsystems: 6
> > Flags: 2node
> > Ports Bound: 0
> > Node name: sentinel1c
> > Node ID: 1
> > Multicast addresses: 239.192.109.100
> > Node addresses: 10.0.0.1
> >
> > So the second node never joined.
> > I know for a fact that the network connection between them is working,
> > as they sync DRBD.
> >
> > cluster.conf is here:
> >
> > <?xml version="1.0"?>
> > <cluster config_version="20" name="sentinel">
> >         <cman two_node="1" expected_votes="1"/>
> >         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >         <clusternodes>
> >                 <clusternode name="sentinel1c" nodeid="1" votes="1">
> >                         <com_info>
> >                                 <rootsource name="drbd"/>
> >                                 <!--<chrootenv  mountpoint      =
> > "/var/comoonics/chroot"
> >                                                 fstype          = "ext3"
> >                                                 device          =
> > "/dev/sda2"
> >                                                 chrootdir       =
> > "/var/comoonics/chroot"
> >                                 />-->
> >                                 <syslog name="localhost"/>
> >                                 <rootvolume     name            =
> > "/dev/drbd1"
> >                                                 mountopts       =
> > "defaults,noatime,nodiratime,noquota"
> >                                 />
> >                                 <eth    name    = "eth0"
> >                                         ip      = "10.0.0.1"
> >                                         mac     = "00:0B:DB:92:C5:E1"
> >                                         mask    = "255.255.255.0"
> >                                         gateway = ""
> >                                 />
> >                                 <fenceackserver user    = "root"
> >                                                 passwd  = "password"
> >                                 />
> >                         </com_info>
> >                         <fence>
> >                                 <method name = "1">
> >                                         <device name = "sentinel1d"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >                 <clusternode name="sentinel2c" nodeid="2" votes="1">
> >                         <com_info>
> >                                 <rootsource name="drbd"/>
> >                                 <!--<chrootenv  mountpoint      =
> > "/var/comoonics/chroot"
> >                                                 fstype          = "ext3"
> >                                                 device          =
> > "/dev/sda2"
> >                                                 chrootdir       =
> > "/var/comoonics/chroot"
> >                                 />-->
> >                                 <syslog name="localhost"/>
> >                                 <rootvolume     name            =
> > "/dev/drbd1"
> >                                                 mountopts       =
> > "defaults,noatime,nodiratime,noquota"
> >                                 />
> >                                 <eth    name    = "eth0"
> >                                         ip      = "10.0.0.2"
> >                                         mac     = "00:0B:DB:90:4E:1B"
> >                                         mask    = "255.255.255.0"
> >                                         gateway = ""
> >                                 />
> >                                 <fenceackserver user    = "root"
> >                                                 passwd  = "password"
> >                                 />
> >                         </com_info>
> >                         <fence>
> >                                 <method name = "1">
> >                                         <device name = "sentinel2d"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >         </clusternodes>
> >         <cman/>
> >         <fencedevices>
> >                 <fencedevice agent="fence_drac" ipaddr="192.168.254.252"
> > login="root" name="sentinel1d" passwd="password"/>
> >                 <fencedevice agent="fence_drac" ipaddr="192.168.254.253"
> > login="root" name="sentinel2d" passwd="password"/>
> >         </fencedevices>
> >         <rm>
> >                 <failoverdomains/>
> >                 <resources/>
> >         </rm>
> > </cluster>
> >
> > What could be causing the nodes to not join in the cluster?
>
> A bit of additional information. When both nodes come up at the same
> time, they actually sort out the journals between them correctly. One
> gets 0, the other 1.
>
> But almost immediately afterwards, this happens on the 2nd node:
> dlm: closing connection to node 1
> dlm: connect from non cluster node
>
> shortly followed by DRBD keeling over:
>
> drbd1: Handshake successful: DRBD Network Protocol version 86
> drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
> drbd1: conn( WFConnection -> WFReportParams )
> drbd1: Discard younger/older primary did not found a decision
> Using discard-least-changes instead
> drbd1: State change failed: Device is held open by someone
> drbd1:   state = { cs:WFReportParams st:Primary/Unknown
> ds:UpToDate/DUnknown r--
> - }
> drbd1:  wanted = { cs:WFReportParams st:Secondary/Unknown
> ds:UpToDate/DUnknown r
> --- }
> drbd1: helper command: /sbin/drbdadm pri-lost-after-sb
> drbd1: Split-Brain detected, dropping connection!
> drbd1: self
> 866625728B4E10B9:E4C3366683AFBC6B:ED24F75CC7B3F4A5:EFFAB6EF6A3CC469
> drbd1: peer
> 572F799325FDF21D:E4C3366683AFBC6B:ED24F75CC7B3F4A4:EFFAB6EF6A3CC469
> drbd1: conn( WFReportParams -> Disconnecting )
> drbd1: helper command: /sbin/drbdadm split-brain
> drbd1: error receiving ReportState, l: 4!
> drbd1: asender terminated
> drbd1: tl_clear()
> drbd1: Connection closed
> drbd1: conn( Disconnecting -> StandAlone )
> drbd1: receiver terminated
>
> At this point the 1st node seems to lock up, but despite fencing being
> set up, the 2nd node doesn't get powered down. The fencing device is a
> DRAC III ERA/O. Rebooting the 2nd node makes things revert back to it
> trying to use JID 0, which is already used by the 1st node, and things
> go wrong again.
>
> I'm sure I must be missing something obvious here, but for the life of
> me I cannot see what.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss