[Linux-cluster] Two-node cluster unpatched B doesn't see patched A

Mon Feb 11 16:47:50 UTC 2008

Sutton, Harry (MSE) wrote:
> The most recent set of patches for RHCS, comprising:
> 
> RHBA-2008:0093    dlm-kernel bug fix update
> RHBA-2008:0092    cman-kernel bug fix update
> RHBA-2008:0060    cman bug fix update
> RHBA-2008:0095    gnbd-kernel bug fix update
> RHBA-2008:0096    GFS-kernel bug fix update
> RHSA-2008:0055    Important: kernel security and bug fix update
> 
> has resulted in a problem in my two-node (production) cluster. Let me
> explain ;-)
> 
> I have a three-node test cluster where I install all patches before
> rolling them into my (two-node) production cluster; I know, I know,
> they're not the same, and that's the only difference I can see in what
> has happened here (a first in two years). In the three-node cluster
> (which, just to complicate things, only had two active nodes at the
> time), I rolled these patches through the two nodes without taking the
> whole cluster down. That is:
> 
> 1. Stop all cluster services on Node A. Disable auto-start using
> chkconfig off <cluster-service-name>. Services stop successfully, Node A
> leaves the cluster, Node B continues running all shared cluster services
> (GFS, Fibre-channel-connected shared storage, HP MSA1000).
> 2. Patch Node A, reboot to new kernel, re-install HP-supplied QLogic
> driver, edit /etc/modprobe.conf for failover settings, rebuild initrd
> file for QLogic drivers, reboot, re-enable auto-start of cluster
> services, reboot once more and the cluster re-forms.
> 3. Repeat Steps 1 and 2 for Node B
> 4. Cluster is restored to normal operation, both nodes fully patched.
> 
> On my production cluster, which uses a Quorum Disk in place of the third
> node, I completed steps 1 and 2 on Node A, but the cluster did NOT
> reform. cman sends out its advertisement, and I can see that Node B
> receives it (by looking at the tcpdump traces), but Node B never responds.
> 
> So: before I take down Node B (which is currently the only one running
> my production services), can someone either (a) explain why the cluster
> is not re-forming, or (b) assure me that by restoring both systems to
> the same patch level, the cluster WILL reform properly? (Which begs the
> question: why did my test cluster survive the patch process and my
> production cluster didn't? Same versions of everything......)
> 
> Thanks in advance, and best regards,

I'm pretty certain that even simply rebooting node B will let the
cluster re-form. I've heard of this problem before but never got to the
bottom of it because it seems to be quite rare. It is almost certainly
some state in node B that is preventing it replying to node A's join
requests - I suspect it's a bug to do with protocol ACK numbers but
can't be sure.

Before you do it, would you be so kind as to send me the tcpdumps of the
(non-)conversation, including the HELLO messages from node B. It might
help in tracking it down.

Thanks,

-- 

Chrissie