[Linux-cluster] Two-node cluster unpatched B doesn't see patched A

Thu Feb 7 22:08:22 UTC 2008

The most recent set of patches for RHCS, comprising:

RHBA-2008:0093    dlm-kernel bug fix update
RHBA-2008:0092    cman-kernel bug fix update
RHBA-2008:0060    cman bug fix update
RHBA-2008:0095    gnbd-kernel bug fix update
RHBA-2008:0096    GFS-kernel bug fix update
RHSA-2008:0055    Important: kernel security and bug fix update

has resulted in a problem in my two-node (production) cluster. Let me 
explain ;-)

I have a three-node test cluster where I install all patches before 
rolling them into my (two-node) production cluster; I know, I know, 
they're not the same, and that's the only difference I can see in what 
has happened here (a first in two years). In the three-node cluster 
(which, just to complicate things, only had two active nodes at the 
time), I rolled these patches through the two nodes without taking the 
whole cluster down. That is:

1. Stop all cluster services on Node A. Disable auto-start using 
chkconfig off <cluster-service-name>. Services stop successfully, Node A 
leaves the cluster, Node B continues running all shared cluster services 
(GFS, Fibre-channel-connected shared storage, HP MSA1000).
2. Patch Node A, reboot to new kernel, re-install HP-supplied QLogic 
driver, edit /etc/modprobe.conf for failover settings, rebuild initrd 
file for QLogic drivers, reboot, re-enable auto-start of cluster 
services, reboot once more and the cluster re-forms.
3. Repeat Steps 1 and 2 for Node B
4. Cluster is restored to normal operation, both nodes fully patched.

On my production cluster, which uses a Quorum Disk in place of the third 
node, I completed steps 1 and 2 on Node A, but the cluster did NOT 
reform. cman sends out its advertisement, and I can see that Node B 
receives it (by looking at the tcpdump traces), but Node B never responds.

So: before I take down Node B (which is currently the only one running 
my production services), can someone either (a) explain why the cluster 
is not re-forming, or (b) assure me that by restoring both systems to 
the same patch level, the cluster WILL reform properly? (Which begs the 
question: why did my test cluster survive the patch process and my 
production cluster didn't? Same versions of everything......)

Thanks in advance, and best regards,

    /Harry Sutton, RHCA
     Hewlett-Packard Company
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 6255 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080207/0a325d13/attachment.bin>