[Linux-cluster] Clustered NFS problem

Wed Sep 21 15:03:57 UTC 2005

Hi,

We have 2 servers HP Proliant 380 G3 (RedHat Advanced Server 3) attached
by fiber optic to the storage area network SAN HP MSA1000 and we want to  
install and configure The RedHat Cluster Suite.

I setuped and configured a clustered NFS on the 2 servers RAC1 and RACGFS.

clumanager-1.2.26.1-1
redhat-config-cluster-1.0.7-1

I have created 2 quorum partitions /dev/sdd2 and /dev/sdd3  (100MB each).

I created another huge partition /dev/sdd4 (over 600GB) and formatted it in 
ext3 file system.

I installed the cluster suite on the 1st node (RAC1) and 2nd node RACGFS and 
I started the rawdevices on the two nodes RAC1 and RACGFS (it's OK).

This the hosts file /etc/host on the node1 (RAC1) and node2 RACGFS

Do not remove the following line, or various programs
# that require network functionality will fail.
#127.0.0.1 rac1 localhost.localdomain localhost
127.0.0.1              localhost.localdomain localhost
#
# Private hostnames
#
192.168.253.3           rac1.project.net     rac1
192.168.253.4           rac2.project.net     rac2
192.168.253.10          racgfs.project.net     racgfs
192.168.253.20          raclu_nfs.project.net   raclu_nfs
#
# Hostnames used for Interconnect
#
1.1.1.1                 rac1i.project.net    rac1i
1.1.1.2                 rac2i.project.net    rac2i
1.1.1.3                 racgfsi.project.net    racgfsi
#
192.168.253.5           infra.project.net       infra
192.168.253.7 ractest.project.net     ractest
#

I generated a /etc/cluster.xml on the 1st node RAC1 and the 2nd node RACGFS.

<?xml version="1.0"?>
<cluconfig version="3.0">
  <clumembd broadcast="no" interval="750000" loglevel="5" multicast="yes" 
multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/>
  <cluquorumd loglevel="5" pinginterval="1" tiebreaker_ip=""/>
  <clurmtabd loglevel="5" pollinterval="4"/>
  <clusvcmgrd loglevel="5" use_netlink="yes"/>
  <clulockd loglevel="5"/>
  <cluster config_viewnumber="24" key="978dcd78e05c5961cf1aaaa03b41209b" 
name="cisn"/>
  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1" 
rawshadow="/dev/raw/raw2" type="raw"/>
  <members>
    <member id="0" name="192.168.253.3" watchdog="no"/>
    <member id="1" name="192.168.253.10" watchdog="no"/>
  </members>
  <services>
    <service checkinterval="5" failoverdomain="cisncluster" id="0" 
maxfalsestarts="0" maxrestarts="0" name="nfs_cisn" userscript="None">
      <service_ipaddresses>
        <service_ipaddress broadcast="None" id="0" 
ipaddress="192.168.253.20" monitor_link="0" netmask="255.255.255.0"/>
      </service_ipaddresses>
      <device id="0" name="/dev/sdd4">
        <mount forceunmount="yes" mountpoint="/u04"/>
        <nfsexport id="0" name="/u04">
          <client id="0" name="*" options="rw"/>
        </nfsexport>
      </device>
    </service>
  </services>
  <failoverdomains>
    <failoverdomain id="0" name="cisncluster" ordered="yes" restricted="no">
      <failoverdomainnode id="0" name="192.168.253.3"/>
      <failoverdomainnode id="1" name="192.168.253.10"/>
    </failoverdomain>
  </failoverdomains>
</cluconfig>

I created a NFS share on /u04 (mount on /dev/sdd4) using the Cluster GUI 
manager on RAC1.
I launched on the 2 nodes Rac1 and RACgfs the following command:
service clumanager start

I checked the result on the 2 nodes, on RAC1:

clustat  results :

Cluster Status - project                                                  
09:04:34
Cluster Quorum Incarnation #1
Shared State: Shared Raw Device Driver v1.2

  Member             Status
  ------------------ ----------
  192.168.253.3      Active     <-- You are here
  192.168.253.10     Active

  Service        Status   Owner (Last)     Last Transition Chk Restarts
  -------------- -------- ---------------- --------------- --- --------
  nfs_cisn       started  192.168.253.3    09:07:59 Sep 21   5        0

on RacGfs: clustat  results :

Cluster Status - cisn                                                  
09:07:39
Cluster Quorum Incarnation #3
Shared State: Shared Raw Device Driver v1.2

  Member             Status
  ------------------ ----------
  192.168.253.3      Active
  192.168.253.10     Active     <-- You are here

  Service        Status   Owner (Last)     Last Transition Chk Restarts
  -------------- -------- ---------------- --------------- --- --------
  nfs_cisn       started  192.168.253.3    09:07:59 Sep 21   5        0

When I launched ifconfig on RAC1, we saw that the service IP address 
192.168.253.20 is generated on eth2:0.

And I launched on  other servers the following command:
mount –t nfs 192.168.253.20:/u04 /u04

And all are OK, I can list the /u04 content from any server.

But my only problem is:

When I want to try a test if the clustered NFS will work fine, I rebooted 
RAC1 frequently and RACGFS continue to work as the failover server and when 
I launched ifconfig on RACGFS, we saw that the service IP address 
192.168.253.20 is generated on eth0:0 .
We can list /u04 content (clustered NFS mount) on the other servers after 
few seconds of RAC1 rebooting:

But after many reboots, I expect a big problem, the both cluster node 
servers  cannot obtain the service IP address 192.168.253.20 when I launch 
ifconfig on the both nodes.

On Rac1:

eth0      Link encap:Ethernet  HWaddr 00:0B:CD:EF:2B:C1
          inet addr:1.1.1.1  Bcast:1.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:89170 errors:0 dropped:0 overruns:0 frame:0
          TX packets:87405 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:17288193 (16.4 Mb)  TX bytes:14452757 (13.7 Mb)
          Interrupt:15

eth2      Link encap:Ethernet  HWaddr 00:0B:CD:FF:44:02
          inet addr:192.168.253.3  Bcast:192.168.253.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1349991 errors:0 dropped:0 overruns:0 frame:0
          TX packets:435450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1592635536 (1518.8 Mb)  TX bytes:162026101 (154.5 Mb)
          Interrupt:7

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1001181 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1001181 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:76097441 (72.5 Mb)  TX bytes:76097441 (72.5 Mb)

On RACGFS:

eth0      Link encap:Ethernet  HWaddr 00:14:38:50:D3:E4
          inet addr:192.168.253.10  Bcast:192.168.253.255  
Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:211223 errors:0 dropped:0 overruns:0 frame:0
          TX packets:160026 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:14917480 (14.2 Mb)  TX bytes:13886063 (13.2 Mb)
          Interrupt:25

eth1      Link encap:Ethernet  HWaddr 00:14:38:50:D3:E3
          inet addr:1.1.1.3  Bcast:1.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:26

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:184529 errors:0 dropped:0 overruns:0 frame:0
          TX packets:184529 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:10971489 (10.4 Mb)  TX bytes:10971489 (10.4 Mb)

I tried many commands, I stopped the cluster services On both nodes and 
restart it but unfortunately it doesn’t work and we cannot obtain the 
clustered NFS mount.

Have you any idea to fix this problem?

Thanks for your replies and help

Abbes Bettahar
514-296-0756