[Linux-cluster] weird happenings on my cluster and another panic.

Fri Oct 27 16:06:01 UTC 2006

On Thu, 2006-10-26 at 21:03 -0400, jason at monsterjam.org wrote:

> Oct 25 20:31:14 tf1 rpcidmapd: rpc.idmapd startup succeeded
> Oct 25 20:31:14 tf1 kernel:   Vendor: DELL      Model: PERC 4/DC         Rev: 351X
> Oct 25 20:31:14 tf1 kernel:   Type:   Processor                          ANSI SCSI revision: 02
> Oct 25 20:31:14 tf1 kernel: scsi[1]: scanning scsi channel 1 [Phy 1] for non-raid devices
> Oct 25 20:31:14 tf1 kernel:   Vendor: DELL      Model: PERC 4/DC         Rev: 351X
> Oct 25 20:31:14 tf1 kernel:   Type:   Processor                          ANSI SCSI revision: 02
> Oct 25 20:31:14 tf1 kernel:   Vendor: DELL      Model: PV22XS            Rev: E.17
> Oct 25 20:31:14 tf1 kernel:   Type:   Processor                          ANSI SCSI revision: 03
> Oct 25 20:31:14 tf1 kernel: scsi[1]: scanning scsi channel 2 [virtual] for logical drives
> Oct 25 20:31:14 tf1 kernel:   Vendor: MegaRAID  Model: LD 0 RAID5  139G  Rev: 351X
> Oct 25 20:31:14 tf1 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 02
> Oct 25 20:31:14 tf1 kernel: scsi1 (2,0,0) : reservation conflict

Those things are in "cluster mode", right?

> Oct 25 20:31:14 tf1 kernel: sdb: asking for cache data failed
> Oct 25 20:31:14 tf1 kernel: sdb: assuming drive cache: write through
> Oct 25 20:31:14 tf1 kernel:  sdb: sdb1
> Oct 25 20:31:14 tf1 kernel: Attached scsi disk sdb at scsi1, channel 2, id 0, lun 0
> Oct 25 20:31:14 tf1 kernel: Adaptec aacraid driver (1.1-5[2412])
> Oct 25 20:31:14 tf1 kernel: device-mapper: 4.5.0-ioctl (2005-10-04) initialised: dm-devel at redhat.com
> Oct 25 20:31:14 tf1 kernel: EXT3-fs: INFO: recovery required on readonly filesystem.
> Oct 25 20:31:14 tf1 kernel: EXT3-fs: write access will be enabled during recovery.
> 
> so sdb is the gfs volume and is already locked by the other server at this point is my guess.

GFS doesn't do SCSI reservations.  Both nodes need concurrent write
access to the disks.  More to the point, see below...

> Oct 25 20:36:13 tf1 kernel: ------------[ cut here ]------------
> ...
> Oct 25 20:36:13 tf1 kernel:  <0>Fatal exception: panic in 5 seconds

^^^ Argh.

> so my question now is that it appears that I have something misconfigured.. tf1 should come up as secondary while tf2 is running as 
> primary, right? or should tf1 come up and take over as primary and tf2 let him?

Irrespective of anything you did (or didn't do), the panic above is a
bug in cman (or maybe the kernel, but not likely).

... The node panicked trying to start up the cluster software, before
GFS (or rgmanager, or dlm) was even in the picture.  You'll note that in
the modules list, 'gfs' and 'dlm' are not even listed.

I hope the newer cman-kernel / dlm-kernel fixes it ;) 

-- Lon