[Linux-cluster] Cluster fail over fails when umounting fs

karthikeyan knagalin at redhat.com
Sat Aug 19 05:45:08 UTC 2006


hi,
      	plz check the following
	1) e2fsck with -c /dev/sda1
	2) hard disk with vendor supply hardware/SAN health monitoring utility
	3) Is it any network flood like DOS attack in your network.

regards
karthikeyan.N

Neil Watson wrote:
> I'm build a cluster that runs a DB2 service.  The cluster has 2 nodes
> in an active standby configuration.  I am now performing fail over
> tests.
> 
> Shared resources:
> DB2 controlled by /etc/init.d/db2 start stop script.
> Floating IP address.
> /db2 ext3 file system located on a SAN and connected via HBA.
> 
> Nodes are fenced with ILO cards.
> 
> Nodes are running AS4 x86_64 with the Redhat Cluster Suite.  RPMs are up
> to date.
> 
> Procedure:
> 
> 1. Connect to DB2 remotely and begin a long SQL insert program.
> 2. While the inserts a being performed, disconnected the fibre cable
> from the HBA, on the active node.
> 3. Examine the system logs an observe for fail over.
> 
> Observations:
> 
> 1. Cluster does not fail over to standby node.  Service becomes
> unavailable.
> 2. The log files of the active node report a 'generic error' about the 
> status
> of the shared file system.
> 
> Aug 16 15:32:37 caesar kernel: qla2300 0000:06:01.0: LOOP DOWN detected 
> (2).
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 15839
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 15847
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 15855
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical 
> block 1974
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 103813199
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical 
> block 12976642
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: Aborting journal on device sda1.
> Aug 16 15:32:45 caesar kernel: ext3_abort called.
> Aug 16 15:32:45 caesar kernel: EXT3-fs error (device sda1): 
> ext3_journal_start_sb: Detected aborted journal
> Aug 16 15:32:45 caesar kernel: Remounting filesystem read-only
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 8279
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical 
> block 1027
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 
> 103546959
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical 
> block 12943362
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> status on fs "db2" 
> returned 1 (generic error)
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> Stopping service db2
> Aug 16 15:32:48 caesar clurgmgrd: [5159]: <info> Executing 
> /etc/rc.d/init.d/db2 stop
> Aug 16 15:32:48 caesar su(pam_unix)[1227]: session opened for user 
> dwapinst by (uid=0)
> Aug 16 15:32:49 caesar su:
> Aug 16 15:32:49 caesar su: Instance  : dwapinst
> Aug 16 15:32:49 caesar su: DB2 State : Available
> Aug 16 15:32:49 caesar su(pam_unix)[1227]: session closed for user dwapinst
> Aug 16 15:32:49 caesar db2:  succeeded
> Aug 16 15:32:49 caesar su(pam_unix)[1322]: session opened for user 
> dwapinst by (uid=0)
> Aug 16 15:36:51 caesar su(pam_unix)[5473]: session opened for user root 
> by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session opened for user 
> dwapinst by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su:
> Aug 16 15:36:55 caesar su: Instance  : dwapinst
> Aug 16 15:36:55 caesar su: DB2 State : Operable
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session closed for user dwapinst
> Aug 16 15:36:55 caesar db2:  failed
> 3. The are no log entries for this event on the standby node.
> 
> Why does the cluster fail during this test?  What does the 'generic error'
> mean?
> 




More information about the Linux-cluster mailing list