[Linux-cluster] DRBD8 and GFS issues
Tiago Cruz
tiagocruz at forumgdh.net
Wed Jun 11 23:10:43 UTC 2008
Hello guys,
I'm trying to use one cluster with 2 nodes, using DRDB 8.x and GFS 1.x
on RHEL 5.2 x84_64.
The problem is: Then one machine was gone (node2) the node1 stop to work
(one simple 'ls -l' on shared mounted point) until the second machine
return.
I'm using GFS on this way:
# gfs_mkfs -t hotsite:gfs-00 -p lock_dlm -j 2 /dev/drbd0
# mount -v /dev/drbd0 /test
'Causing a FAIL on second node on this way:
# echo 1 > /proc/sys/kernel/sysrq
# echo b > /proc/sysrq-trigger
==============================================================================
$ cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="hotsite" config_version="4">
<cman two_node="1" expected_votes="1"/>
<fence_daemon post_join_delay="60">
</fence_daemon>
<clusternodes>
<clusternode name="drdb_hotsite-1" nodeid="1">
<fence>
<method name="single">
<device name="gnbd" ipaddr="192.168.0.3"/>
</method>
</fence>
</clusternode>
<clusternode name="drdb_hotsite-2" nodeid="2">
<fence>
<method name="single">
<device name="gnbd" ipaddr="192.168.0.3"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="manual" agent="fence_manual"/>
</fencedevices>
</cluster>
==============================================================================
Follow the logs:
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: PingAck did not arrive in time.
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: asender terminated
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Terminating asender thread
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: short read expecting header on sock: r=-512
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Creating new current UUID
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Connection closed
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: outdate-peer helper broken, returned 0
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: old = { cs:NetworkFailure st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: new = { cs:Unconnected st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver terminated
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver (re)started
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: old = { cs:Unconnected st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: new = { cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( Unconnected -> WFConnection )
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] The token was lost in the OPERATIONAL state.
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 2.
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: drdb_hotsite-2 not a cluster member after 0 sec post_fail_delay
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 0.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token because I am the rep.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 31 high seq received 31
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id for ring 168
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY state.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member 192.168.0.3:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 356 rep 192.168.0.3
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 31 high delivered 31 received flag 1
Jun 11 19:59:12 hotsite-bsb-la-1 kernel: dlm: closing connection to node 2
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to originate any messages in recovery.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF token
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3)
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.4)
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3)
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the primary component and will provide service.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL state.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message 192.168.0.3
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CPG ] got joinlist message from node 1
Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
.....
Jun 11 20:01:32 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 11.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token because I am the rep.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 14 high seq received 14
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id for ring 16c
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY state.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member 192.168.0.3:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360 rep 192.168.0.3
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 14 high delivered 14 received flag 1
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [1] member 192.168.0.4:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360 rep 192.168.0.4
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 9 high delivered 9 received flag 1
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to originate any messages in recovery.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF token
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION CHANGE
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.3)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.4)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0) ip(192.168.0.4)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the primary component and will provide service.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL state.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message 192.168.0.4
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message 192.168.0.3
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CPG ] got joinlist message from node 1
Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Trying to acquire journal lock...
Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Looking at journal...
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Handshake successful: Agreed network protocol version 88
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: old = { cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: new = { cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFConnection -> WFReportParams )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Starting asender thread (from drbd0_receiver [526])
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: data-integrity-alg: <not-used>
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Outdated )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: tl_clear()
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: susp( 1 -> 0 )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Secondary -> Primary )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Began resync as SyncSource (will sync 548864 KB [137216 bits set]).
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 20:05:05 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Acquiring the transaction lock...
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Replaying journal...
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Replayed 0 of 1 blocks
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: replays = 0, skips = 0, sames = 1
Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Journal replayed in 5s
Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Resync done (total 15 sec; paused 0 sec; 36588 K/sec)
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Trying to join cluster "lock_dlm", "hotsite:gfs-00"
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: dlm: Using TCP for communications
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Joined cluster. Now mounting FS...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Trying to acquire journal lock...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Looking at journal...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Done
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Trying to acquire journal lock...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Looking at journal...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done
Jun 11 20:07:25 hotsite-bsb-la-1 kernel: dlm: connecting to 2
Thanks!
--
Tiago Cruz
http://everlinux.com
Linux User #282636
More information about the Linux-cluster
mailing list