[Linux-cluster] RHCS TestCluster with ScientificLinux 5.2

Wed Aug 19 15:36:15 UTC 2009

Hello experts,

In preparation of a new production system I have setup a testsystem
with RHCS under ScientificLinux 5.2.
It consists of two identical nodes FSC/RX200, a Brocade FibreChannel 
switch, a FSC/SX80 FibreChannel-raidarray, and a APC-powerswitch.
The configuration is attached at the end.
I want to have three (GFS) filesystems
- exported via nfs to a number of clients, each service has its own IP
- backup the filesystems via TSM to a TSM-server

I see some problems I need an explanation/solution:
1) if I connect the nfs-clients to the IP of the configured nfs-service
   started e.g. on tnode02, the filesystem is mounted, but I see a
   strange lock problem
     tnode02 kernel: portmap: server "client-IP" not\
         responding, timed out
     tnode02 kernel: lockd: server "client-IP" not responding,\
         timed out
    It goes away, if I bind the nfs-clients direct to the IP of the
    the node tnode02. If I start the services on tnode01, it is exactly
    the same problem, solved by binding the clients direct to tnode01. It
    does not depend on firewall configuration, it is the same if I switch
    off iptables on both tnode0[12] and clients.

2) tnode02 died with kernel-panic; no real helpfull logs found regarding
    the panic, I only see a lot of messages regarding problems nfs
    locking over gfs :

   kernel: lockd: grant for unknown block
   kernel: dlm: dlm_plock_callback: lock granted after lock request failed

   before the kernel paniced, but is this a real reason to panic?

   At this point tnod01 tried to take over the cluster and to fence
   tnode02, which gave an error, I do not understand, because fence_apc
   runnig by hand (On, Off, Status) is properly working

tnode01 fenced[3127]: fencing node "tnode02.phy.tu-dresden.de"
tnode01 fenced[3127]: agent "fence_apc" reports: Traceback (most recent 
call last):   File "/sbin/fence_apc", line 829, in ?     main()   File 
"/sbin/fence_apc", line 349, in main     do_power_off(sock)   File 
"/sbin/fence_apc", line 813, in do_power_off     x = 
do_power_switch(sock, "off")   File "/sbi
tnode01 fenced[3127]: agent "fence_apc" reports: n/fence_apc", line 611, 
in do_power_switch     result_code, response = power_off(txt + ndbuf) 
File "/sbin/fence_apc", line 817, in power_off     x = 
power_switch(buffer, False, "2", "3");   File "/sbin/fence_apc", line 
810, in power_switch     raise "un
tnode01 fenced[3127]: agent "fence_apc" reports: known screen 
encountered in \n" + str(lines) + "\n" unknown screen encountered in 
['', '> 2', '', '', '------- Configure Outlet 
------------------------------------------------------', '', '    # 
State  Ph  Name                     Pwr On Dly  Pwr Off D
tnode01 fenced[3127]: agent "fence_apc" reports: ly  Reboot Dur.', ' 
----------------------------------------------------------------------------', 
'    2  ON     1   Outlet 2                 0 sec       0 sec        5 
sec', '', '     1- Outlet Name         : Outlet 2', '     2- Power On 
Delay(sec) :
tnode01 fenced[3127]: agent "fence_apc" reports: 0', '     3- Power Off 
Delay(sec): 0', '     4- Reboot Duration(sec): 5', '     5- Accept 
Changes      : ', '', '     ?- Help, <ESC>- Back, <ENTER>- Refresh, 
<CTRL-L>- Event Log']

   So tnode01 did not stop fencing tnod02 and so it was not able to take
   over the cluster services. Via system-config-cluster one was also not
   able to stop any service. Stopping processes did not really help. The
   only solution at this point was to power down both nodes and restart
   the cluster.

so my questions:

Is there a solution for the locking problem if one bind the nfs clients 
to the configured nfs service IP ?

Is there an explanation/solution of the nfs (dlm) GFS locking problem ?

Is there a signifivant update to fence_apc I have missed ?

Why do I have to configure the GFS resources with the "force umount" option?
   I was under the impression that one can mount GFS filesystems
   simultanously on a number of nodes. If I define the GFS resources
   without "force umount", the filesystem is not mounted at all. But
   running the defined TSM service depends on all mounted filesystems.

Thanks for any help,  Rainer

The configuration is
Scientific Linux SL release 5.2 (Boron)
kernel 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 12:51:10 EDT 2009 x86_64 
x86_64 x86_64 GNU/Linux
device-mapper-multipath-0.4.7-23.el5_3.2.x86_64
rgmanager-2.0.38-2.el5_2.1.x86_64
system-config-cluster-1.0.52-1.1.noarch
cman-2.0.84-2.el5.x86_64
kmod-gfs-0.1.23-5.el5_2.4.x86_64
gfs2-utils-0.1.44-1.el5.x86_64
gfs-utils-0.1.17-1.el5.x86_64
lvm2-cluster-2.02.32-4.el5.x86_64
modcluster-0.12.0-7.el5.x86_64
ricci-0.12.0-7.el5.x86_64
openais-0.80.3-15.el5.x86_64

cluster.conf
<?xml version="1.0"?>
<cluster alias="tstw_HA2" config_version="115" name="tstw_HA2">
         <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="tnode02.tst.tu-dresden.de" 
nodeid="1" votes="1">
                         <fence>
                                 <method name="1">
                                         <device name="HA_APC" port="2"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="tnode01.tst.tu-dresden.de" 
nodeid="2" votes="1">
                         <fence>
                                 <method name="1">
                                         <device name="HA_APC" port="1"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman expected_votes="1" two_node="1"/>
         <fencedevices>
                 <fencedevice agent="fence_apc" ipaddr="192.168.0.10" 
login="xxx" name="HA_APC" passwd="yy-xxxx"/>
         </fencedevices>
         <rm>
                 <failoverdomains>
                         <failoverdomain name="HA_new_failover" 
ordered="1" restricted="1">
                                 <failoverdomainnode 
name="tnode01.tst.tu-dresden.de" priority="1"/>
                                 <failoverdomainnode 
name="tnode02.tst.tu-dresden.de" priority="2"/>
                         </failoverdomain>
                 </failoverdomains>
                 <resources>
                         <clusterfs device="/dev/VG1/LV00" 
force_unmount="1" fsid="53422" fstype="gfs" mountpoint="/global_home" 
name="home_GFS" options=""/>
                         <nfsexport name="home_nfsexport"/>
                         <nfsclient name="tstw_home" 
options="rw,root_squash" path="/global_home" 
target="tstw*.tst.tu-dresden.de"/>
                         <ip address="111.22.33.32" monitor_link="1"/>
                         <ip address="192.168.20.30" monitor_link="1"/>
                         <nfsclient name="fast_nfs_home_clients" 
options="rw,root_squash" path="/global_home" target="192.168.20.0/24"/>
                         <nfsexport name="cluster_nfsexport"/>
                         <nfsclient name="tstw_cluster" 
options="no_root_squash,ro" path="/global_cluster" 
target="tstw*.tst.tu-dresden.de"/>
                         <nfsclient name="fast_nfs_cluster_clients" 
options="no_root_squash,ro" path="/global_cluster" 
target="192.168.20.0/24"/>
                         <script file="/etc/rc.d/init.d/tsm" 
name="TSM_backup"/>
                         <clusterfs device="/dev/VG1/LV10" 
force_unmount="1" fsid="192" fstype="gfs" mountpoint="/global_cluster" 
name="cluster_GFS" options=""/>
                         <clusterfs device="/dev/VG1/LV20" 
force_unmount="1" fsid="63016" fstype="gfs" mountpoint="/global_soft" 
name="software_GFS" options=""/>
                         <nfsexport name="soft_nfsexport"/>
                         <nfsclient name="tstw_soft" 
options="rw,root_squash" path="/global_soft" 
target="tstw*.tst.tu-dresden.de"/>
                         <nfsclient name="fast_nfs_soft_clients" 
options="rw,root_squash" path="/global_soft" target="192.168.20.0/24"/>
                         <nfsclient name="tsts_home" 
options="no_root_squash,rw" path="/global_home" 
target="tsts0*.tst.tu-dresden.de"/>
                         <nfsclient name="tsts_cluster" 
options="rw,root_squash" path="/global_cluster" 
target="tsts0*.tst.tu-dresden.de"/>
                         <nfsclient name="tsts_soft" 
options="rw,root_squash" path="/global_soft" 
target="tsts0*.tst.tu-dresden.de"/>
                         <nfsclient name="tstf_home" 
options="rw,root_squash" path="/global_home" 
target="tstf*.tst.tu-dresden.de"/>
                         <nfsclient name="tstf_cluster" 
options="rw,root_squash" path="/global_cluster" 
target="tstf*.tst.tu-dresden.de"/>
                         <nfsclient name="tstf_soft" 
options="rw,root_squash" path="/global_soft" 
target="tstf*.tst.tu-dresden.de"/>
                         <ip address="111.22.33.31" monitor_link="1"/>
                         <ip address="111.22.33.30" monitor_link="1"/>
                         <ip address="192.168.20.31" monitor_link="1"/>
                         <ip address="192.168.20.32" monitor_link="1"/>
                         <clusterfs device="/dev/VG1/LV20" 
force_unmount="0" fsid="11728" fstype="gfs" mountpoint="/global_soft" 
name="Software_GFS" options=""/>
                         <clusterfs device="/dev/VG1/LV10" 
force_unmount="0" fsid="36631" fstype="gfs" mountpoint="/global_cluster" 
name="Cluster_GFS" options=""/>
                         <clusterfs device="/dev/VG1/LV00" 
force_unmount="0" fsid="45816" fstype="gfs" mountpoint="/global_home" 
name="Home_GFS" options=""/>
                 </resources>
                 <service autostart="1" domain="HA_new_failover" 
name="service_nfs_home">
                         <nfsexport ref="home_nfsexport"/>
                         <nfsclient ref="tstw_home"/>
                         <ip ref="111.22.33.32"/>
                         <nfsclient ref="tsts_home"/>
                         <nfsclient ref="tstf_home"/>
                         <clusterfs ref="home_GFS"/>
                 </service>
                 <service autostart="1" domain="HA_new_failover" 
name="service_nfs_home_fast">
                         <nfsexport ref="home_nfsexport"/>
                         <nfsclient ref="fast_nfs_home_clients"/>
                         <ip ref="192.168.20.32"/>
                         <clusterfs ref="Home_GFS"/>
                 </service>
                 <service autostart="1" domain="HA_new_failover" 
name="service_nfs_cluster">
                         <nfsexport ref="cluster_nfsexport"/>
                         <nfsclient ref="tstw_cluster"/>
                         <nfsclient ref="tsts_cluster"/>
                         <nfsclient ref="tstf_cluster"/>
                         <ip ref="111.22.33.30"/>
                         <clusterfs ref="cluster_GFS"/>
                 </service>
                 <service autostart="1" name="service_nfs_cluster_fast">
                         <nfsexport ref="cluster_nfsexport"/>
                         <ip ref="192.168.20.30"/>
                         <nfsclient ref="fast_nfs_cluster_clients"/>
                         <clusterfs ref="Cluster_GFS"/>
                 </service>
                 <service autostart="1" domain="HA_new_failover" 
name="service_TSM">
                         <ip ref="111.22.33.31"/>
                         <script ref="TSM_backup"/>
                         <clusterfs ref="Software_GFS"/>
                         <clusterfs ref="Cluster_GFS"/>
                         <clusterfs ref="Home_GFS"/>
                 </service>
                 <service autostart="1" domain="HA_new_failover" 
name="service_nfs_soft">
                         <nfsexport ref="soft_nfsexport"/>
                         <nfsclient ref="tstw_soft"/>
                         <nfsclient ref="tsts_soft"/>
                         <nfsclient ref="tstf_soft"/>
                         <ip ref="111.22.33.31"/>
                         <clusterfs ref="software_GFS"/>
                 </service>
                 <service autostart="1" domain="HA_new_failover" 
name="service_nfs_soft_fast">
                         <nfsexport ref="soft_nfsexport"/>
                         <nfsclient ref="fast_nfs_soft_clients"/>
                         <ip ref="192.168.20.31"/>
                         <clusterfs ref="Software_GFS"/>
                 </service>
         </rm>
</cluster>

-- 
| R.Schwierz at physik.tu-dresden.de                     |
| Rainer  Schwierz, Inst. f. Kern- und Teilchenphysik |
| TU Dresden,       D-01062 Dresden                   |
| Tel. ++49 351 463 32957    FAX ++49 351 463 37292   |
| http://iktp.tu-dresden.de/~schwierz/                |