[Linux-cluster] RHCS TestCluster with ScientificLinux 5.2

Thu Sep 17 05:30:09 UTC 2009

Hello,

hmm, meanwhile the fence_apc problem is fixed by a more recent version 
of fence_apc.

But the nfs lock problem is still open. Does it mean I definitely 
should not use ScientificLinux and switch to Fedora 11 or RHEL5.4?

Cheers, Rainer

Rainer Schwierz wrote:
> Hello experts,
> 
> In preparation of a new production system I have setup a testsystem
> with RHCS under ScientificLinux 5.2.
> It consists of two identical nodes FSC/RX200, a Brocade FibreChannel 
> switch, a FSC/SX80 FibreChannel-raidarray, and a APC-powerswitch.
> The configuration is attached at the end.
> I want to have three (GFS) filesystems
> - exported via nfs to a number of clients, each service has its own IP
> - backup the filesystems via TSM to a TSM-server
> 
> I see some problems I need an explanation/solution:
> 1) if I connect the nfs-clients to the IP of the configured nfs-service
>   started e.g. on tnode02, the filesystem is mounted, but I see a
>   strange lock problem
>     tnode02 kernel: portmap: server "client-IP" not\
>         responding, timed out
>     tnode02 kernel: lockd: server "client-IP" not responding,\
>         timed out
>    It goes away, if I bind the nfs-clients direct to the IP of the
>    the node tnode02. If I start the services on tnode01, it is exactly
>    the same problem, solved by binding the clients direct to tnode01. It
>    does not depend on firewall configuration, it is the same if I switch
>    off iptables on both tnode0[12] and clients.
> 
> 2) tnode02 died with kernel-panic; no real helpfull logs found regarding
>    the panic, I only see a lot of messages regarding problems nfs
>    locking over gfs :
> 
>   kernel: lockd: grant for unknown block
>   kernel: dlm: dlm_plock_callback: lock granted after lock request failed
> 
>   before the kernel paniced, but is this a real reason to panic?
> 
>   At this point tnod01 tried to take over the cluster and to fence
>   tnode02, which gave an error, I do not understand, because fence_apc
>   runnig by hand (On, Off, Status) is properly working
> 
> tnode01 fenced[3127]: fencing node "tnode02.phy.tu-dresden.de"
> tnode01 fenced[3127]: agent "fence_apc" reports: Traceback (most recent 
> call last):   File "/sbin/fence_apc", line 829, in ?     main()   File 
> "/sbin/fence_apc", line 349, in main     do_power_off(sock)   File 
> "/sbin/fence_apc", line 813, in do_power_off     x = 
> do_power_switch(sock, "off")   File "/sbi
> tnode01 fenced[3127]: agent "fence_apc" reports: n/fence_apc", line 611, 
> in do_power_switch     result_code, response = power_off(txt + ndbuf) 
> File "/sbin/fence_apc", line 817, in power_off     x = 
> power_switch(buffer, False, "2", "3");   File "/sbin/fence_apc", line 
> 810, in power_switch     raise "un
> tnode01 fenced[3127]: agent "fence_apc" reports: known screen 
> encountered in \n" + str(lines) + "\n" unknown screen encountered in 
> ['', '> 2', '', '', '------- Configure Outlet 
> ------------------------------------------------------', '', '    # 
> State  Ph  Name                     Pwr On Dly  Pwr Off D
> tnode01 fenced[3127]: agent "fence_apc" reports: ly  Reboot Dur.', ' 
> ----------------------------------------------------------------------------', 
> '    2  ON     1   Outlet 2                 0 sec       0 sec        5 
> sec', '', '     1- Outlet Name         : Outlet 2', '     2- Power On 
> Delay(sec) :
> tnode01 fenced[3127]: agent "fence_apc" reports: 0', '     3- Power Off 
> Delay(sec): 0', '     4- Reboot Duration(sec): 5', '     5- Accept 
> Changes      : ', '', '     ?- Help, <ESC>- Back, <ENTER>- Refresh, 
> <CTRL-L>- Event Log']
> 
>   So tnode01 did not stop fencing tnod02 and so it was not able to take
>   over the cluster services. Via system-config-cluster one was also not
>   able to stop any service. Stopping processes did not really help. The
>   only solution at this point was to power down both nodes and restart
>   the cluster.
> 
> so my questions:
> 
> Is there a solution for the locking problem if one bind the nfs clients 
> to the configured nfs service IP ?
> 
> Is there an explanation/solution of the nfs (dlm) GFS locking problem ?
> 
> Is there a signifivant update to fence_apc I have missed ?
> 
> Why do I have to configure the GFS resources with the "force umount" 
> option?
>   I was under the impression that one can mount GFS filesystems
>   simultanously on a number of nodes. If I define the GFS resources
>   without "force umount", the filesystem is not mounted at all. But
>   running the defined TSM service depends on all mounted filesystems.
> 
> Thanks for any help,  Rainer
> 
> The configuration is
> Scientific Linux SL release 5.2 (Boron)
> kernel 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 12:51:10 EDT 2009 x86_64 
> x86_64 x86_64 GNU/Linux
> device-mapper-multipath-0.4.7-23.el5_3.2.x86_64
> rgmanager-2.0.38-2.el5_2.1.x86_64
> system-config-cluster-1.0.52-1.1.noarch
> cman-2.0.84-2.el5.x86_64
> kmod-gfs-0.1.23-5.el5_2.4.x86_64
> gfs2-utils-0.1.44-1.el5.x86_64
> gfs-utils-0.1.17-1.el5.x86_64
> lvm2-cluster-2.02.32-4.el5.x86_64
> modcluster-0.12.0-7.el5.x86_64
> ricci-0.12.0-7.el5.x86_64
> openais-0.80.3-15.el5.x86_64
> 
> cluster.conf
> <?xml version="1.0"?>
> <cluster alias="tstw_HA2" config_version="115" name="tstw_HA2">
>         <fence_daemon clean_start="0" post_fail_delay="0" 
> post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="tnode02.tst.tu-dresden.de" nodeid="1" 
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="HA_APC" port="2"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="tnode01.tst.tu-dresden.de" nodeid="2" 
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="HA_APC" port="1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_apc" ipaddr="192.168.0.10" 
> login="xxx" name="HA_APC" passwd="yy-xxxx"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="HA_new_failover" 
> ordered="1" restricted="1">
>                                 <failoverdomainnode 
> name="tnode01.tst.tu-dresden.de" priority="1"/>
>                                 <failoverdomainnode 
> name="tnode02.tst.tu-dresden.de" priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <clusterfs device="/dev/VG1/LV00" 
> force_unmount="1" fsid="53422" fstype="gfs" mountpoint="/global_home" 
> name="home_GFS" options=""/>
>                         <nfsexport name="home_nfsexport"/>
>                         <nfsclient name="tstw_home" 
> options="rw,root_squash" path="/global_home" 
> target="tstw*.tst.tu-dresden.de"/>
>                         <ip address="111.22.33.32" monitor_link="1"/>
>                         <ip address="192.168.20.30" monitor_link="1"/>
>                         <nfsclient name="fast_nfs_home_clients" 
> options="rw,root_squash" path="/global_home" target="192.168.20.0/24"/>
>                         <nfsexport name="cluster_nfsexport"/>
>                         <nfsclient name="tstw_cluster" 
> options="no_root_squash,ro" path="/global_cluster" 
> target="tstw*.tst.tu-dresden.de"/>
>                         <nfsclient name="fast_nfs_cluster_clients" 
> options="no_root_squash,ro" path="/global_cluster" 
> target="192.168.20.0/24"/>
>                         <script file="/etc/rc.d/init.d/tsm" 
> name="TSM_backup"/>
>                         <clusterfs device="/dev/VG1/LV10" 
> force_unmount="1" fsid="192" fstype="gfs" mountpoint="/global_cluster" 
> name="cluster_GFS" options=""/>
>                         <clusterfs device="/dev/VG1/LV20" 
> force_unmount="1" fsid="63016" fstype="gfs" mountpoint="/global_soft" 
> name="software_GFS" options=""/>
>                         <nfsexport name="soft_nfsexport"/>
>                         <nfsclient name="tstw_soft" 
> options="rw,root_squash" path="/global_soft" 
> target="tstw*.tst.tu-dresden.de"/>
>                         <nfsclient name="fast_nfs_soft_clients" 
> options="rw,root_squash" path="/global_soft" target="192.168.20.0/24"/>
>                         <nfsclient name="tsts_home" 
> options="no_root_squash,rw" path="/global_home" 
> target="tsts0*.tst.tu-dresden.de"/>
>                         <nfsclient name="tsts_cluster" 
> options="rw,root_squash" path="/global_cluster" 
> target="tsts0*.tst.tu-dresden.de"/>
>                         <nfsclient name="tsts_soft" 
> options="rw,root_squash" path="/global_soft" 
> target="tsts0*.tst.tu-dresden.de"/>
>                         <nfsclient name="tstf_home" 
> options="rw,root_squash" path="/global_home" 
> target="tstf*.tst.tu-dresden.de"/>
>                         <nfsclient name="tstf_cluster" 
> options="rw,root_squash" path="/global_cluster" 
> target="tstf*.tst.tu-dresden.de"/>
>                         <nfsclient name="tstf_soft" 
> options="rw,root_squash" path="/global_soft" 
> target="tstf*.tst.tu-dresden.de"/>
>                         <ip address="111.22.33.31" monitor_link="1"/>
>                         <ip address="111.22.33.30" monitor_link="1"/>
>                         <ip address="192.168.20.31" monitor_link="1"/>
>                         <ip address="192.168.20.32" monitor_link="1"/>
>                         <clusterfs device="/dev/VG1/LV20" 
> force_unmount="0" fsid="11728" fstype="gfs" mountpoint="/global_soft" 
> name="Software_GFS" options=""/>
>                         <clusterfs device="/dev/VG1/LV10" 
> force_unmount="0" fsid="36631" fstype="gfs" mountpoint="/global_cluster" 
> name="Cluster_GFS" options=""/>
>                         <clusterfs device="/dev/VG1/LV00" 
> force_unmount="0" fsid="45816" fstype="gfs" mountpoint="/global_home" 
> name="Home_GFS" options=""/>
>                 </resources>
>                 <service autostart="1" domain="HA_new_failover" 
> name="service_nfs_home">
>                         <nfsexport ref="home_nfsexport"/>
>                         <nfsclient ref="tstw_home"/>
>                         <ip ref="111.22.33.32"/>
>                         <nfsclient ref="tsts_home"/>
>                         <nfsclient ref="tstf_home"/>
>                         <clusterfs ref="home_GFS"/>
>                 </service>
>                 <service autostart="1" domain="HA_new_failover" 
> name="service_nfs_home_fast">
>                         <nfsexport ref="home_nfsexport"/>
>                         <nfsclient ref="fast_nfs_home_clients"/>
>                         <ip ref="192.168.20.32"/>
>                         <clusterfs ref="Home_GFS"/>
>                 </service>
>                 <service autostart="1" domain="HA_new_failover" 
> name="service_nfs_cluster">
>                         <nfsexport ref="cluster_nfsexport"/>
>                         <nfsclient ref="tstw_cluster"/>
>                         <nfsclient ref="tsts_cluster"/>
>                         <nfsclient ref="tstf_cluster"/>
>                         <ip ref="111.22.33.30"/>
>                         <clusterfs ref="cluster_GFS"/>
>                 </service>
>                 <service autostart="1" name="service_nfs_cluster_fast">
>                         <nfsexport ref="cluster_nfsexport"/>
>                         <ip ref="192.168.20.30"/>
>                         <nfsclient ref="fast_nfs_cluster_clients"/>
>                         <clusterfs ref="Cluster_GFS"/>
>                 </service>
>                 <service autostart="1" domain="HA_new_failover" 
> name="service_TSM">
>                         <ip ref="111.22.33.31"/>
>                         <script ref="TSM_backup"/>
>                         <clusterfs ref="Software_GFS"/>
>                         <clusterfs ref="Cluster_GFS"/>
>                         <clusterfs ref="Home_GFS"/>
>                 </service>
>                 <service autostart="1" domain="HA_new_failover" 
> name="service_nfs_soft">
>                         <nfsexport ref="soft_nfsexport"/>
>                         <nfsclient ref="tstw_soft"/>
>                         <nfsclient ref="tsts_soft"/>
>                         <nfsclient ref="tstf_soft"/>
>                         <ip ref="111.22.33.31"/>
>                         <clusterfs ref="software_GFS"/>
>                 </service>
>                 <service autostart="1" domain="HA_new_failover" 
> name="service_nfs_soft_fast">
>                         <nfsexport ref="soft_nfsexport"/>
>                         <nfsclient ref="fast_nfs_soft_clients"/>
>                         <ip ref="192.168.20.31"/>
>                         <clusterfs ref="Software_GFS"/>
>                 </service>
>         </rm>
> </cluster>
> 

-- 
| R.Schwierz at physik.tu-dresden.de                     |
| Rainer  Schwierz, Inst. f. Kern- und Teilchenphysik |
| TU Dresden,       D-01062 Dresden                   |
| Tel. ++49 351 463 32957    FAX ++49 351 463 37292   |
| http://iktp.tu-dresden.de/~schwierz/                |