[Linux-cluster] Freeze with cluster-2.03.11
David Teigland
teigland at redhat.com
Tue Mar 31 16:10:07 UTC 2009
On Tue, Mar 31, 2009 at 11:18:51AM +0200, Kadlecsik Jozsef wrote:
> On Mon, 30 Mar 2009, David Teigland wrote:
>
> > On Fri, Mar 27, 2009 at 06:19:50PM +0100, Kadlecsik Jozsef wrote:
> > >
> > > Combing through the log files I found the following:
> > >
> > > Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member after 0 sec post_fail_delay
> > > Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs"
> > > Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node e1??e1??
> > > Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success
> > >
> > > The line saying "can't get node number for node e1??e1??" might be
> > > innocent, but looks suspicious. Why fenced could not get the victim name?
> >
> > I've not seen that before, and I can't explain either how cman_get_node()
> > could have failed or why it printed a garbage string. It's a non-essential
> > bit of code, so that error should not be related to your problem.
>
> Yes, it is surely not related to the freeze, but disturbing.
>
> Hm, in the function dispatch_fence_agent there's an ordering issue, I
> believe. The variable victim_nodename is freed but update_cman is called
> with variable victim pointing to the just freed victim_nodename.
Ah, you're exactly right, thanks for finding that. This bug was fixed in the
STABLE3 branch, and I've just pushed a fix for the next 2.03 release. This
bug will cause secondary fence methods to fail, so it's more serious than the
garbage string.
(Strangly, the victim_nodename code doesn't exist at all in the RHEL5 branch,
which is why we didn't catch this. I'm not sure how RHEL5/STABLE2 got out of
sync there, that's not supposed to happen.)
Dave
More information about the Linux-cluster
mailing list