[Linux-cluster] failover hangs on open file

danwest at comcast.net danwest at comcast.net
Wed Nov 30 15:39:44 UTC 2005


There seems to be a bug that affects service groups when a process outside the cluster’s control has open files on a file system that is managed via the cluster.  I am running the RHEL4U1 code release.  An example is defined below.
 
A simple 2 node cluster (nodeA and nodeB) with a Virtual IP resource and an ext3 filesystem resource managed via CLVMD.  I have removed a script resource for simplicity.  My service is started on nodeA, it has the VIP and ext3 mount (/mnt/cluster).  I can relocate the service to nodeB with no problem “clusvcadm –r service –m nodeB”.  I can also relocate it back without a problem … but  if I open a file on the cluster managed ext3 mount (vi /mnt/cluster/test) and try to migrate the service it fails every time.
 
The behavior of the RHEL3 codebase was to kill all processes associated with the mount on failure and/or relocation.
 
Here is the output from /var/log/messages during the relocation error:
 
Nov 29 12:22:12 nodeA clurgmgrd[8445]: <notice> Stopping service SERVICE1
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <notice> stop on fs "testfs" returned 2 (invalid argument(s))
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <crit> #12: RG SERVICE1 failed to stop; intervention required
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <notice> Service SERVICE1 is failed
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <warning> #70: Attempting to restart service SERVICE1 locally.
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <err> #43: Service SERVICE1 has failed; can not start.
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <alert> #2: Service SERVICE1 returned failure code.  Last Owner: nodeA
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <alert> #4: Administrator intervention required.
 
Output of clustat after the relocation with open file:
 
Member Status: Quorate, Group Member
 
  Member Name                              State      ID
  ------ ----                              -----      --
  NodeB                                    Online     0x0000000000000002
 
  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  SERVICE1             (null)                         failed
 
Any ideas?
 
Thanks,
 Dan




More information about the Linux-cluster mailing list