[Linux-cluster] DRBD+GFS - Link is down, Link is up
Giuseppe Fuggiano
giuseppe.fuggiano at gmail.com
Thu Jun 18 19:22:34 UTC 2009
Hi all,
I configured GFS over DRBD (active-active) with RHCS and IPMI as fence device.
When I try to mount my GFS resource, my interconnect interface goes
down and one node is fenced. This happen every time.
DRBD joins and become primary...
Jun 18 19:04:30 alice kernel: drbd0: Handshake successful: Agreed
network protocol version 89
Jun 18 19:04:30 alice kernel: drbd0: Peer authenticated using 20 bytes
of 'sha1' HMAC
Jun 18 19:04:30 alice kernel: drbd0: conn( WFConnection -> WFReportParams )
Jun 18 19:04:30 alice kernel: drbd0: Starting asender thread (from
drbd0_receiver [3315])
Jun 18 19:04:30 alice kernel: drbd0: data-integrity-alg: <not-used>
Jun 18 19:04:30 alice kernel: drbd0: drbd_sync_handshake:
Jun 18 19:04:30 alice kernel: drbd0: self
2BA45318C0A122D1:CBAA0E591815072F:3F39591B4EF90EDD:2E40DDEB552666B9
Jun 18 19:04:30 alice kernel: drbd0: peer
CBAA0E591815072E:0000000000000000:3F39591B4EF90EDD:2E40DDEB552666B9
Jun 18 19:04:30 alice kernel: drbd0: uuid_compare()=1 by rule 7
Jun 18 19:04:30 alice kernel: drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Jun 18 19:04:30 alice kernel: drbd0: peer( Secondary -> Primary )
Jun 18 19:04:31 alice kernel: drbd0: conn( WFBitMapS -> SyncSource )
pdsk( UpToDate -> Inconsistent )
Jun 18 19:04:31 alice kernel: drbd0: Began resync as SyncSource (will
sync 16384 KB [4096 bits set]).
Jun 18 19:04:33 alice kernel: drbd0: Resync done (total 1 sec; paused
0 sec; 16384 K/sec)
Jun 18 19:04:33 alice kernel: drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )
Then the fence domain is OK:
Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering GATHER state from 11.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Creating commit token
because I am the rep.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Saving state aru 1b high
seq received 1b
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Storing new sequence id for ring 34
Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering COMMIT state.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering RECOVERY state.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116:
Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep
10.17.44.116
Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru 1b high delivered 1b
received flag 1
Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [1] member 10.17.44.117:
Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep
10.17.44.117
Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru a high delivered a
received flag 1
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Did not need to originate
any messages in recovery.
Jun 18 19:04:35 alice openais[3475]: [TOTEM] Sending initial ORF token
Jun 18 19:04:35 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE
Jun 18 19:04:36 alice openais[3475]: [CLM ] New Configuration:
Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116)
Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Left:
Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Joined:
Jun 18 19:04:36 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE
Jun 18 19:04:36 alice openais[3475]: [CLM ] New Configuration:
Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116)
Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117)
Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Left:
Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Joined:
Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117)
Jun 18 19:04:36 alice openais[3475]: [SYNC ] This node is within the
primary component and will provide service.
Jun 18 19:04:36 alice openais[3475]: [TOTEM] entering OPERATIONAL state.
Jun 18 19:04:36 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.116
Jun 18 19:04:36 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.117
Jun 18 19:04:36 alice openais[3475]: [CPG ] got joinlist message from node 1
Jun 18 19:04:40 alice kernel: dlm: connecting to 2
Jun 18 19:04:40 alice kernel: dlm: got connection from 2
WHY DOWN?
Jun 18 19:04:53 alice kernel: eth2: Link is Down
Jun 18 19:04:53 alice openais[3475]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jun 18 19:04:53 alice openais[3475]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Jun 18 19:04:53 alice openais[3475]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Jun 18 19:04:53 alice openais[3475]: [TOTEM] entering GATHER state from 2.
Jun 18 19:04:57 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:04:57 alice kernel: eth2: 10/100 speed: disabling TSO
Something goes wrong with DRBD
Jun 18 19:04:58 alice kernel: drbd0: PingAck did not arrive in time.
Jun 18 19:04:58 alice kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jun 18 19:04:58 alice kernel: drbd0: asender terminated
Jun 18 19:04:58 alice kernel: drbd0: Terminating asender thread
Jun 18 19:04:58 alice kernel: drbd0: short read expecting header on sock: r=-512
Jun 18 19:04:58 alice kernel: drbd0: Creating new current UUID
Jun 18 19:04:58 alice kernel: drbd0: Connection closed
Jun 18 19:04:58 alice kernel: drbd0: conn( NetworkFailure -> Unconnected )
Jun 18 19:04:58 alice kernel: drbd0: receiver terminated
Jun 18 19:04:58 alice kernel: drbd0: Restarting receiver thread
Jun 18 19:04:58 alice kernel: drbd0: receiver (re)started
Jun 18 19:04:58 alice kernel: drbd0: conn( Unconnected -> WFConnection )
Something goes wrong in the cluster
Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering GATHER state from 0.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Creating commit token
because I am the rep.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Saving state aru 3c high
seq received 3c
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Storing new sequence id for ring 38
Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering COMMIT state.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering RECOVERY state.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116:
Jun 18 19:04:58 alice openais[3475]: [TOTEM] previous ring seq 52 rep
10.17.44.116
Jun 18 19:04:58 alice openais[3475]: [TOTEM] aru 3c high delivered 3c
received flag 1
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Did not need to originate
any messages in recovery.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] Sending initial ORF token
Jun 18 19:04:58 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE
Jun 18 19:04:58 alice openais[3475]: [CLM ] New Configuration:
Jun 18 19:04:58 alice kernel: dlm: closing connection to node 2
Jun 18 19:04:58 alice fenced[3494]: bob not a cluster member after 0
sec post_fail_delay
Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116)
"bob" node is fenced (it just joined!)
Jun 18 19:04:58 alice fenced[3494]: fencing node "bob"
Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Left:
Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117)
Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Joined:
Jun 18 19:04:58 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE
Jun 18 19:04:58 alice openais[3475]: [CLM ] New Configuration:
Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116)
Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Left:
Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Joined:
Jun 18 19:04:58 alice openais[3475]: [SYNC ] This node is within the
primary component and will provide service.
Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering OPERATIONAL state.
Jun 18 19:04:58 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.116
Jun 18 19:04:58 alice openais[3475]: [CPG ] got joinlist message from node 1
Jun 18 19:05:03 alice kernel: eth2: Link is Down
Jun 18 19:05:08 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:08 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:05:12 alice kernel: eth2: Link is Down
Jun 18 19:05:13 alice fenced[3494]: fence "bob" success
Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Trying
to acquire journal lock...
Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Looking
at journal...
Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Done
eth2 is up and down....
Jun 18 19:05:15 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:15 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:05:21 alice kernel: eth2: Link is Down
Jun 18 19:05:24 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:24 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:05:29 alice kernel: eth2: Link is Down
Jun 18 19:05:33 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:05:33 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:07:26 alice kernel: eth2: Link is Down
Jun 18 19:07:29 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:07:29 alice kernel: eth2: 10/100 speed: disabling TSO
Jun 18 19:07:36 alice kernel: eth2: Link is Down
Jun 18 19:07:38 alice kernel: eth2: Link is Up 100 Mbps Full Duplex,
Flow Control: None
Jun 18 19:07:38 alice kernel: eth2: 10/100 speed: disabling TSO
Consider that if I don't mount GFS, the node is not fenced and the
failover domains becomes active.
So, I guess the problem is in GFS... and not for example with the NIC.
Here is my configuration:
# cat /etc/drbd.conf
global {
usage-count no;
}
resource r1 {
protocol C;
syncer {
rate 10M;
verify-alg sha1;
}
startup {
become-primary-on both;
wfc-timeout 150;
}
disk {
on-io-error detach;
}
net {
allow-two-primaries;
cram-hmac-alg "sha1";
shared-secret "123456";
after-sb-0pri discard-least-changes;
after-sb-1pri violently-as0p;
after-sb-2pri violently-as0p;
rr-conflict violently;
ping-timeout 50;
}
on alice {
device /dev/drbd0;
disk /dev/sda2;
address 10.17.44.116:7789;
meta-disk internal;
}
on bob {
device /dev/drbd0;
disk /dev/sda2;
address 10.17.44.117:7789;
meta-disk internal;
}
}
# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="web" config_version="20" name="web">
<fence_daemon post_fail_delay="0" post_join_delay="6"/>
<clusternodes>
<clusternode name="alice" nodeid="1" votes="1">
<fence>
<method name="1">
<device lanplus="" name="alice-ipmi"/>
</method>
</fence>
</clusternode>
<clusternode name="bob" nodeid="2" votes="1">
<fence>
<method name="1">
<device lanplus="" name="bob-ipmi"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" auth="password"
ipaddr="10.17.44.134" login="cnmca" name="alice-ipmi"
passwd="xxxxxx"/>
<fencedevice agent="fence_ipmilan" auth="password"
ipaddr="10.17.44.135" login="cnmca" name="bob-ipmi" passwd="xxxxxx"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="alice-domain"
ordered="1" restricted="1">
<failoverdomainnode name="alice" priority="1"/>
<failoverdomainnode name="bob" priority="2"/>
</failoverdomain>
<failoverdomain name="bob-domain" ordered="1"
restricted="1">
<failoverdomainnode name="bob" priority="1"/>
<failoverdomainnode name="alice" priority="2"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.17.44.16" monitor_link="1"/>
<ip address="10.17.44.17" monitor_link="1"/>
</resources>
<service autostart="1" domain="alice-domain"
name="alice-alias" recovery="relocate">
<ip ref="10.17.44.16"/>
</service>
<service autostart="1" domain="bob-domain"
name="bob-alias" recovery="relocate">
<ip ref="10.17.44.17"/>
</service>
</rm>
</cluster>
# cat /etc/hosts:
127.0.0.1 localhost.localdomain localhost
172.17.44.116 alice
172.17.44.117 bob
# ifconfig
bond0 Link encap:Ethernet HWaddr 00:15:17:51:70:38
inet addr:10.17.44.116 Bcast:10.17.44.255 Mask:255.255.255.0
inet6 addr: fe80::215:17ff:fe51:7038/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:49984 errors:0 dropped:0 overruns:0 frame:0
TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0
collisions:11221 txqueuelen:0
RX bytes:16151284 (15.4 MiB) TX bytes:102618030 (97.8 MiB)
eth0 Link encap:Ethernet HWaddr 00:15:17:51:70:38
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:49984 errors:0 dropped:0 overruns:0 frame:0
TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0
collisions:11221 txqueuelen:100
RX bytes:16151284 (15.4 MiB) TX bytes:102618030 (97.8 MiB)
Memory:f9140000-f9160000
eth1 Link encap:Ethernet HWaddr 00:15:17:51:70:38
UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:f91a0000-f91c0000
eth2 Link encap:Ethernet HWaddr 00:19:99:29:08:8B
inet addr:172.17.44.116 Bcast:172.17.44.255 Mask:255.255.255.0
inet6 addr: fe80::219:99ff:fe29:88b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:20 errors:0 dropped:0 overruns:0 frame:0
TX packets:45 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:1200 (1.1 KiB) TX bytes:7902 (7.7 KiB)
Memory:f9200000-f9220000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:3541 errors:0 dropped:0 overruns:0 frame:0
TX packets:3541 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:464552 (453.6 KiB) TX bytes:464552 (453.6 KiB)
I hope there is someone just experienced this bad issue.
Thanks in advance.
--
Giuseppe
More information about the Linux-cluster
mailing list