[Linux-cluster] GFS + DRBD Problems
Marc Grimme
grimme at atix.de
Mon Mar 3 19:42:48 UTC 2008
On Monday 03 March 2008 12:23:36 gordan at bobich.net wrote:
> Hi,
>
> I'm appear to be a experiencing a strange compound problem with this, that
> is proving rather difficult to troubleshoot, so I'm hoping someone here
> can spot a problem I hadn't.
>
> I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single
> node mounts GFS OK and works, but after a while seems to just block for
> disk. Very much as if it started trying to fence the other node and is
> waiting for acknowledgement. There are no fence devices defined (so this
> could be a possibility), but the other node was never powered up in the
> first place, so it is somewhat beyond me why it might suddenly decide to
> try to fence it. This usually happens after a period of idleness. If the
> node is used, this doesn't seem to happen, but leaving it along for half
> an hour causes it to block for disk I/O.
As I cannot help you too much with DBRB problems here some infos to help you
debugging them at least ;-) .
Regarding OSR being stuck (manual fencing):
You should try using the fenceacksv. As far as I am informed of your
configuration it also is configured:
<clusternode name="node1" nodeid="1" votes="1">
<com_info>
<rootsource name="drbd"/>
<!--<chrootenv mountpoint
= "/var/comoonics/chroot"
fstype = "ext3"
device = "/dev/sda2"
chrootdir
= "/var/comoonics/chroot"
/>-->
<syslog name="localhost"/>
<rootvolume name = "/dev/drbd1"
mountopts
= "noatime,nodiratime,noquota"
/>
<eth name = "eth0"
ip = "10.0.0.1"
mac = "xxx"
mask = "255.0.0.0"
gateway = ""
/>
<fenceackserver user = "root"
passwd = "password"
/>
</com_info>
Now you could do a telnet on the hung node on port 12242 login and should
automatically see, if it is in manual fencing state or not.
If you also install comoonics-fenceacksv-plugins-py you will be able to
trigger sysrqs via the fenceacksv.
Hope that helps with debugging.
Marc.
>
> Unfortunately, it doesn't end there. When an attempt is made to dual-mount
> the GFS file system before the secondary is fully up to date (but is
> connected and syncing), the 2nd node to join notices an inconsistency, and
> withdraws from the cluster. In the process, GFS gets corrupted, and the
> only way to get it to mount again on either node is to repair it with
> fsck.
>
> I'm not sure if this is a problem with my cluster setup or not, but I
> cannot see that the nodes would fail to find each other and get DLM
> working. Console logs seem to indicate that everything is in fact OK, and
> the nodes are connected directly via a cross-over cable.
>
> If the nodes are in sync by the time GFS tries to mount, the mount
> succeeds, but everything grinds to a halt shortly afterwards - so much so
> that the only way to get things moving again is to hard-reset one of the
> nodes, preferably the 2nd one to join.
>
> Here is where the second thing that seems wrong happend - the first node
> doesn't just lock-up at this point, as one might expect (when a connected
> node disappears, e.g. due to a hard reset, cluster is supposed to try to
> fence it until it cleanly rejoins - and it can't possibly fence the other
> node since I haven't configured any fencing devices yet). This doesn't seem
> to happen. The first node seems to continue like nothing happened. This is
> possibly connected to the fact that by this point, GFS is corrupted and has
> to be fsck-ed at next boot. This part may be a cluster setup issue, so I'll
> raise that on the cluster list, although it seems to be a DRBD specific
> peculiarity - using a SAN doesn't have this issue with a nearly identical
> cluster.conf (only difference being the block device specification).
>
> The cluster.conf is as follows:
> <?xml version="1.0"?>
> <cluster config_version="18" name="sentinel">
> <cman two_node="1" expected_votes="1"/>
> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> <clusternodes>
> <clusternode name="sentinel1c" nodeid="1" votes="1">
> <com_info>
> <rootsource name="drbd"/>
> <!--<chrootenv mountpoint =
> "/var/comoonics/chroot" fstype = "ext3" device =
> "/dev/sda2" chrootdir = "/var/comoonics/chroot" />-->
> <syslog name="localhost"/>
> <rootvolume name =
> "/dev/drbd1" mountopts = "noatime,nodiratime,noquota" />
> <eth name = "eth0"
> ip = "10.0.0.1"
> mac = "00:0B:DB:92:C5:E1"
> mask = "255.255.255.0"
> gateway = ""
> />
> <fenceackserver user = "root"
> passwd = "secret"
> />
> </com_info>
> <fence>
> <method name="1"/>
> </fence>
> </clusternode>
> <clusternode name="sentinel2c" nodeid="2" votes="1">
> <com_info>
> <rootsource name="drbd"/>
> <!--<chrootenv mountpoint =
> "/var/comoonics/chroot" fstype = "ext3" device =
> "/dev/sda2" chrootdir = "/var/comoonics/chroot" />-->
> <syslog name="localhost"/>
> <rootvolume name =
> "/dev/drbd1" mountopts = "noatime,nodiratime,noquota" />
> <eth name = "eth0"
> ip = "10.0.0.2"
> mac = "00:0B:DB:90:4E:1B"
> mask = "255.255.255.0"
> gateway = ""
> />
> <fenceackserver user = "root"
> passwd = "secret"
> />
> </com_info>
> <fence>
> <method name="1"/>
> </fence>
> </clusternode>
> </clusternodes>
> <cman/>
> <fencedevices/>
> <rm>
> <failoverdomains/>
> <resources/>
> </rm>
> </cluster>
>
> Getting to the logs can be a bit difficult with OSR (they get reset on
> reboot, and it's rather difficult getting to them when the node stops
> responding without rebooting it), so I don't have those at the moment.
>
> Any suggestions would be welcome at this point.
>
> TIA.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
--
Gruss / Regards,
Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/ http://www.open-sharedroot.org/
**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10
85716 Unterschleissheim
Deutschland/Germany
Phone: +49-89 452 3538-0
Fax: +49-89 990 1766-0
Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962
Vorstand:
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)
Vorsitzender des Aufsichtsrats:
Dr. Martin Buss
More information about the Linux-cluster
mailing list