[Linux-cluster] CS5 still problem "Node x is undead" (contd.)

Mon May 26 08:33:11 UTC 2008

Hi

As told before, the patch :
http://sources.redhat.com/git/?p=cluster.git;a=commit;h=b2686ffe984c517110b949d604c54a71800b67c9
does not solve the problem for my configuration ...

Just an idea/question : could this problem be also linked
to the defaut value of token ? Or has it nothing to do with it ?
Because currently, I have this problem with a Quorum disk
configured and no token record in cluster.conf, so token
is at its default value ...

???
Thanks
Regards
Alain Moullé

> Hi Lon
> I've applied the patch (see resulting code below) but the patch
> does not solve the problem.
> Is there another patch linked to this problem ?
> Thanks
> Regards
> Alain Moullé
<
>>>> when testing a two-nodes cluster with quorum disk, when
>>>> I poweroff the node1 , node 2 fences well the node 1 and
>>>> failovers the service, but in log of node 2 I have before and after
>>>> the fence success messages  many messages like this:
>>>> Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>>>> Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for

node 2

>>>> Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>>>> Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for

node 2

>>>> Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>>>> Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for

node 2

>>>> Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
>>>> Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for

node 2

>>>> Apr 24 11:30:08 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.

http://sources.redhat.com/git/?p=cluster.git;a=commit;h=b2686ffe984c517110b949d604c54a71800b67c9

Resulting code after patch application in cman/qdisk/main.c :
===========================================================
                   Transition from Online -> Evicted
                 */
                if (ni[x].ni_misses > ctx->qc_tko &&
                     state_run(ni[x].ni_status.ps_state)) {

                        /*
                           Mark our internal views as dead if nodes miss too
                           many heartbeats...  This will cause a master
                           transition if no live master exists.
                         */
                        if (ni[x].ni_status.ps_state >= S_RUN &&
                            ni[x].ni_seen) {
                                clulog(LOG_DEBUG, "Node %d DOWN\n",
                                       ni[x].ni_status.ps_nodeid);
                                ni[x].ni_seen = 0;
                        }

                        ni[x].ni_state = S_EVICT;
                        ni[x].ni_status.ps_state = S_EVICT;
                        ni[x].ni_evil_incarnation =
                                ni[x].ni_status.ps_incarnation;

                        /*
                           Write eviction notice if we're the master.
                         */
                        if (ctx->qc_status == S_MASTER) {
                                clulog(LOG_NOTICE,
                                       "Writing eviction notice for node %d\n",
                                       ni[x].ni_status.ps_nodeid);
                                qd_write_status(ctx, ni[x].ni_status.ps_nodeid,
                                                S_EVICT, NULL, NULL, NULL);
                                if (ctx->qc_flags & RF_ALLOW_KILL) {
                                        clulog(LOG_DEBUG, "Telling CMAN to "
                                                "kill the node\n");
                                        cman_kill_node(ctx->qc_ch,
                                                ni[x].ni_status.ps_nodeid);
                                }
                        }

                        /* Clear our master mask for the node after eviction */
                        if (mask)
                                clear_bit(mask, (ni[x].ni_status.ps_nodeid-1),
                                          sizeof(memb_mask_t));
                        continue;
                }

------------------------------