[Linux-cluster] RHEL 5.5 fail-over policy with shared resources

rhurst at bidmc.harvard.edu rhurst at bidmc.harvard.edu
Thu Jul 8 14:17:31 UTC 2010


An (unverified) issue occur to us yesterday when moving a service from a RHEL 4.7 cluster into a new, but existing, RHEL 5.5 cluster:

There was an oversight in regards to one of its GFS Shared Resources; its superblock was not re-written with the new cluster name, so it could not mount on the new cluster.  Starting the cluster service would fail, however, its policy is to remain "disabled", and not fail-over to another domain member.  But it attempted to startup anyways.

That's where we ran into some unexpected trouble.

We attempted to enable the package again and it went along happily mounting private resources, ext3 filesystems, that were now partially mounting on a fail-over domain.  Fortunately for us, the Linux kernel detected some underlying blocks being modifed and immediately switched those mounts to read-only.  We discovered the issue, umount'ed all the filesystems and ran e2fsck -- which happily repaired a few mishaps.

Fixing the GFS superblock solved our problem, but we are curious if this is a "feature" or a "bug" with the fail-over attempt.  With private resources failing, we don't get this recovery effort on another server -- it just fails which is the way we want it to do.

This is still our observation, and we will at some point stage this scenario at our D.R. site for testing, but I thought I bounce it off this list.

Pertinent log information follows:

Jul  7 18:59:16 columbia clurgmgrd[20213]: <notice> Starting disabled service service:TOBY 
Jul  7 19:00:04 columbia clurgmgrd: [20213]: <err> 'mount -t gfs -o noatime /dev/mapper/VGCCC-lvoltobywav /toby/wav' failed, error=1 
Jul  7 19:00:13 columbia clurgmgrd[20213]: <notice> start on clusterfs "CCC-lvoltobywav" returned 2 (invalid argument(s)) 
Jul  7 19:00:13 columbia clurgmgrd[20213]: <warning> #68: Failed to start service:TOBY; return value: 1 
Jul  7 19:00:13 columbia clurgmgrd[20213]: <notice> Stopping service service:TOBY 

Jul  7 19:00:46 columbia clurgmgrd[20213]: <notice> Service service:TOBY is recovering 
Jul  7 19:00:46 columbia clurgmgrd[20213]: <warning> #71: Relocating failed service service:TOBY 
Jul  7 19:01:44 columbia clurgmgrd[20213]: <alert> #2: Service service:TOBY returned failure code.  Last Owner: zodiac 
Jul  7 19:01:44 columbia clurgmgrd[20213]: <alert> #4: Administrator intervention required. 
Jul  7 19:01:45 columbia clurgmgrd[20213]: <alert> #2: Service service:TOBY returned failure code.  Last Owner: zodiac 
Jul  7 19:01:45 columbia clurgmgrd[20213]: <alert> #4: Administrator intervention required. 

Jul  7 19:04:05 columbia clurgmgrd[20213]: <notice> Stopping service service:TOBY 
Jul  7 19:04:05 columbia clurgmgrd[20213]: <notice> Service service:TOBY is disabled




More information about the Linux-cluster mailing list