[Linux-cluster] CS5 still problem "Node x is undead"

Thu May 22 14:43:01 UTC 2008

Hi Lon

I've applied the patch (see resulting code below) but the patch
does not solve the problem.

Is there another patch linked to this problem ?

Thanks
Regards
Alain Moullé

>> when testing a two-nodes cluster with quorum disk, when
>> I poweroff the node1 , node 2 fences well the node 1 and
>> failovers the service, but in log of node 2 I have before and after
>> the fence success messages  many messages like this:
>> Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>> Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for
node 2
>> Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>> Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for
node 2
>> Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>> Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for
node 2
>> Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>> Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for
node 2
>> Apr 24 11:30:08 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.

http://sources.redhat.com/git/?p=cluster.git;a=commit;h=b2686ffe984c517110b949d604c54a71800b67c9

Resulting code after patch application in cman/qdisk/main.c :
===========================================================
                   Transition from Online -> Evicted
                 */
                if (ni[x].ni_misses > ctx->qc_tko &&
                     state_run(ni[x].ni_status.ps_state)) {

                        /*
                           Mark our internal views as dead if nodes miss too
                           many heartbeats...  This will cause a master
                           transition if no live master exists.
                         */
                        if (ni[x].ni_status.ps_state >= S_RUN &&
                            ni[x].ni_seen) {
                                clulog(LOG_DEBUG, "Node %d DOWN\n",
                                       ni[x].ni_status.ps_nodeid);
                                ni[x].ni_seen = 0;
                        }

                        ni[x].ni_state = S_EVICT;
                        ni[x].ni_status.ps_state = S_EVICT;
                        ni[x].ni_evil_incarnation =
                                ni[x].ni_status.ps_incarnation;

                        /*
                           Write eviction notice if we're the master.
                         */
                        if (ctx->qc_status == S_MASTER) {
                                clulog(LOG_NOTICE,
                                       "Writing eviction notice for node %d\n",
                                       ni[x].ni_status.ps_nodeid);
                                qd_write_status(ctx, ni[x].ni_status.ps_nodeid,
                                                S_EVICT, NULL, NULL, NULL);
                                if (ctx->qc_flags & RF_ALLOW_KILL) {
                                        clulog(LOG_DEBUG, "Telling CMAN to "
                                                "kill the node\n");
                                        cman_kill_node(ctx->qc_ch,
                                                ni[x].ni_status.ps_nodeid);
                                }
                        }

                        /* Clear our master mask for the node after eviction */
                        if (mask)
                                clear_bit(mask, (ni[x].ni_status.ps_nodeid-1),
                                          sizeof(memb_mask_t));
                        continue;
                }