[Linux-cluster] GFS2 2 Node Cluster - lost Node - Mount not writeable

Thomas Börnert tb at tbits.net
Tue Feb 26 22:40:30 UTC 2008


Hi List,

2 Servers - connected with crossover

my rpms:
gfs2-utils-0.1.38-1.el5
gfs-utils-0.1.12-1.el5
kmod-gfs2-1.52-1.16.el5
cman-2.0.73-1.el5_1.1

my cluster.conf on both sites
---------------------------------------------------------------------------------
<?xml version="1.0"?>
<cluster name="cluster" config_version="2">
<cman two_node="1" expected_votes="1">
</cman>
<clusternodes>

<clusternode name="node1" votes="1" nodeid="1">
         <fence>
                <method name="human">
                        <device name="human" nodename="node1"/>
                </method>
        </fence>
</clusternode>

<clusternode name="node2" votes="1" nodeid="2">
         <fence>
                <method name="human">
                        <device name="human" nodename="node2"/>
                </method>
        </fence>
</clusternode>
</clusternodes>

<fencedevices>
        <fencedevice name="human" agent="fence_manual"/>
</fencedevices>
</cluster>
---------------------------------------------------------------------------------------
my hosts on both sites
192.168.0.1	node1
192.168.0.2	node2

my mountpoints
mkfs.gfs2 -p lock_dlm -t cluster:drbd -j 2 /dev/drbd0
mount -t gfs2 -o noatime,nodiratime /dev/drbd0 /test
(Btw: => drbd works fine as Primary/Primary)

ok, i can use /test on both sites and can write to files
and so on.

cman_tool nodes
--------------------------------------------------------------------------------------
Node  Sts   Inc   Joined               Name
   1   M    364   2008-02-26 23:20:16  node1
   2   M    360   2008-02-26 23:20:16  node2

cman_tool status
-------------------------------------------------------------------------------------
Version: 6.0.1
Config Version: 3
Cluster Name: cluster
Cluster Id: 34996
Cluster Member: Yes
Cluster Generation: 364
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1  
Active subsystems: 6
Flags: 2node 
Ports Bound: 0  
Node name: node2
Node ID: 2
Multicast addresses: 239.192.136.61 
Node addresses: 192.168.0.2

NOW: i power node1 off !

my log on node2 shows:
-----------------------------------------------------------------------------------------
==> /var/log/messages <==
Feb 26 23:27:22 node2 last message repeated 13 times

==> /var/log/kernel <==
Feb 26 23:27:31 node2 kernel: tg3: eth1: Link is down.
Feb 26 23:27:32 node2 kernel: tg3: eth1: Link is up at 100 Mbps, full duplex.
Feb 26 23:27:32 node2 kernel: tg3: eth1: Flow control is off for TX and off 
for RX.
Feb 26 23:27:36 node2 kernel: drbd0: PingAck did not arrive in time.
Feb 26 23:27:36 node2 kernel: drbd0: peer( Primary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Feb 26 23:27:36 node2 kernel: drbd0: Creating new current UUID
Feb 26 23:27:36 node2 kernel: drbd0: asender terminated
Feb 26 23:27:36 node2 kernel: drbd0: short read expecting header on sock: 
r=-512
Feb 26 23:27:36 node2 kernel: drbd0: tl_clear()
Feb 26 23:27:36 node2 kernel: drbd0: Connection closed
Feb 26 23:27:36 node2 kernel: drbd0: Writing meta data super block now.
Feb 26 23:27:36 node2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Feb 26 23:27:36 node2 kernel: drbd0: receiver terminated
Feb 26 23:27:36 node2 kernel: drbd0: receiver (re)started
Feb 26 23:27:36 node2 kernel: drbd0: conn( Unconnected -> WFConnection )

==> /var/log/messages <==
Feb 26 23:27:37 node2 last message repeated 3 times
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] The token was lost in the 
OPERATIONAL state.
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] Receive multicast socket recv 
buffer size (288000 bytes).
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] Transmit multicast socket send 
buffer size (262142 bytes).
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] entering GATHER state from 2.
Feb 26 23:27:42 node2 root: Process did not exit cleanly, returned 2 with 
signal 0
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] entering GATHER state from 0.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Creating commit token because I 
am the rep.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Saving state aru 31 high seq 
received 31
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Storing new sequence id for ring 
170
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] entering COMMIT state.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] entering RECOVERY state.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] position [0] member 192.168.0.2:
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] previous ring seq 364 rep 
192.168.0.1
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] aru 31 high delivered 31 received 
flag 1
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Did not need to originate any 
messages in recovery.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Sending initial ORF token
Feb 26 23:27:44 node2 openais[3288]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 26 23:27:44 node2 openais[3288]: [CLM  ] New Configuration:
Feb 26 23:27:44 node2 fenced[3307]: node1 not a cluster member after 0 sec 
post_fail_delay
Feb 26 23:27:44 node2 openais[3288]: [CLM  ]       r(0) ip(192.168.0.2)
Feb 26 23:27:44 node2 fenced[3307]: fencing node "node1"

==> /var/log/kernel <==
Feb 26 23:27:44 node2 kernel: dlm: closing connection to node 1

==> /var/log/messages <==
Feb 26 23:27:44 node2 openais[3288]: [CLM  ] Members Left:
Feb 26 23:27:45 node2 openais[3288]: [CLM  ]       r(0) ip(192.168.0.1)
Feb 26 23:27:45 node2 fence_manual: Node node1 needs to be reset before 
recovery can procede.  Waiting for node1 to rejoin the cluster or for manual 
acknowledgement that it has been reset (i.e. fence_ack_manual -n node1)
Feb 26 23:27:45 node2 openais[3288]: [CLM  ] Members Joined:
Feb 26 23:27:45 node2 openais[3288]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 26 23:27:45 node2 openais[3288]: [CLM  ] New Configuration:
Feb 26 23:27:45 node2 openais[3288]: [CLM  ]       r(0) ip(192.168.0.2)
Feb 26 23:27:45 node2 openais[3288]: [CLM  ] Members Left:
Feb 26 23:27:45 node2 openais[3288]: [CLM  ] Members Joined:
Feb 26 23:27:45 node2 openais[3288]: [SYNC ] This node is within the primary 
component and will provide service.
Feb 26 23:27:45 node2 openais[3288]: [TOTEM] entering OPERATIONAL state.
Feb 26 23:27:45 node2 openais[3288]: [CLM  ] got nodejoin message 192.168.0.2
Feb 26 23:27:45 node2 openais[3288]: [CPG  ] got joinlist message from node 2
Feb 26 23:27:47 node2 root: Process did not exit cleanly, returned 2 with 
signal 0
-------------------------------------------------------------------------------------------------------------

ls /test works

BUT

touch /test/testfile hangs ....

cman_tool nodes shows
------------------------------------------------------------------------------------------------------------------
Node  Sts   Inc   Joined               Name
   1   X    364                        node1
   2   M    360   2008-02-26 23:20:16  node2
-----------------------------------------------------------------------------------------------------------------

cman_tool status shows
-----------------------------------------------------------------------------------------------------------------
Version: 6.0.1
Config Version: 3
Cluster Name: cluster
Cluster Id: 34996
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1  
Active subsystems: 6
Flags: 2node 
Ports Bound: 0  
Node name: node2
Node ID: 2
Multicast addresses: 239.192.136.61 
Node addresses: 192.168.0.2
------------------------------------------------------------------------------------------------------------------

my drbd is no problem state is already primary (standalone)

Why can't i write to a gfs partition in the "lost Node" state ?

Now: i power node1 on !

drbd is no problem -> its recovered.
now i start cman
and my touch will be finished ....

Thanks for any ideas and help

-Thomas




More information about the Linux-cluster mailing list