[Linux-cluster] CS5 : clurgmgrd[28359]: segfault

Wed Jan 9 21:54:09 UTC 2008

On Wed, 2008-01-09 at 15:04 +0100, Alain Moulle wrote:
> Hi
> 
> Testing the CS5 on a two-nodes cluster with quorum disk, when I did
> the test ifdown on the heart-beat interface, I got a segfault in log :

> Jan  9 09:45:30 s_sys at am1 openais[28300]: [TOTEM] entering OPERATIONAL state.
> Jan  9 09:45:30 s_sys at am1 openais[28300]: [CLM  ] got nodejoin message 172.16.101.91
> Jan  9 09:45:30 s_sys at am1 openais[28300]: [EVT  ] recovery error node: r(0)
> ip(127.0.0.1)  not found
> Jan  9 09:45:30 s_kernel at am1 kernel: clurgmgrd[28359]: segfault at
> 0000000000000000 rip 0000000000408c4a rsp 00007fff04a2c450 error 4
> Jan  9 09:45:30 s_sys at am1 gfs_controld[28328]: cluster is down, exiting
> Jan  9 09:45:30 s_kernel at am1 kernel: dlm: closing connection to node 2
> Jan  9 09:45:30 s_kernel at am1 kernel: dlm: closing connection to node 0
> Jan  9 09:45:30 s_kernel at am1 kernel: dlm: closing connection to node 1
> Jan  9 09:45:30 s_sys at am1 dlm_controld[28322]: cluster is down, exiting
> Jan  9 09:45:30 s_sys at am1 fenced[28316]: cman_get_nodes error -1 104
> Jan  9 09:45:30 s_sys at am1 fenced[28316]: cluster is down, exiting
> Jan  9 09:45:30 s_sys at am1 clurgmgrd[28358]: <crit> Watchdog: Daemon died,
> rebooting...
> Jan  9 09:45:30 s_sys at am1 shutdown[18377]: shutting down for system halt
> 
> Is-it already a known problem ?

openais died, causing the dlm to go away and rgmanager to crash - the
"nanny" clurgmgrd process rebooted the node.

Although the segfault is probably less than ideal, the nanny process
killing the node is probably fine since the node needs to be fenced at
this point anyway.

What should of happened with rgmanager is:
* it should have seen a negative quorum transition,
* halted cluster services uncleanly, and
* wait to be fenced.

-- Lon