[Linux-cluster] GFS upgrade questions

Wed Jan 21 07:15:38 UTC 2009

On Thu, Dec 18, 2008 at 12:58 PM, Diego Liziero <diegoliz at gmail.com> wrote:
> On Thu, Dec 18, 2008 at 12:18 PM, Fabio M. Di Nitto <fdinitto at redhat.com> wrote:
>>
>> If you are running RHEL or CentOS, you have no reason to upgrade.
>>
>> the RHEL packages and the STABLE2 branch (cluster-2.03.xx) receives the same
>> set of bug fixes.
>> [..]
>>
>> Fabio
>
> Thanks Fabio for your clear explanation.
>
> I was thinking about this upgrade just to see if the "two nodes" case
> was better handled.
>
> Here first we had to add a quorum disk, and despite this, we were
> still having some troubles when rebooting (done with a reboot command
> from a shell):
> rgmanager waiting forever while stopping, services not migrating to
> the second node, fencing not starting when the other node is powered
> off (clean_start="1" and power fencing)...
>
> Adding a third node _seems_ to have solved most of them, though.

One thing that seems not to be solved by the third node is the slow reboot.

After writing "reboot" on a node shell, with all the other nodes
running and with the cluster quorated, sometimes it takes from about 8
to more than 17 minutes before actually rebooting, and the following
messages appear at the console:

openais[pid]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[pid]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
openais[pid]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
openais[pid]: [TOTEM] The network interface is down.
openais[pid]: [TOTEM] entering GATHER state from 15.
openais[pid]: [TOTEM] entering GATHER state from 2.
openais[pid]: [TOTEM] entering GATHER state from 0.
openais[pid]: [TOTEM] The consensus timeout expired.
openais[pid]: [TOTEM] entering GATHER state from 3.
openais[pid]: [TOTEM] The consensus timeout expired.
openais[pid]: [TOTEM] entering GATHER state from 3.

Then the last 2 lines are repeated multiple times.

I saw a mail in this list stating that this can be solved by setting
/proc/sys/net/ipv4/ip_forward to 1, but here it doesn't make any
difference, then another one saying that it could be an iptables
issue, but here linux firewall is disabled.

Is this the correct way of rebooting a node?

Any chance to have a faster clean reboot?

Regards,
Diego.