[Linux-cluster] how to recover from process_recovery_barrier status=-104
David Teigland
teigland at redhat.com
Fri Jul 22 04:37:04 UTC 2005
On Thu, Jul 21, 2005 at 11:51:21PM -0400, Dan B. Phung wrote:
> My cluster went down pretty hard, in that I had to hard reboot several
> machines, and now the fence daemon won't come up. I run:
>
> $ ccsd && cman_tool join -w
> $ fence_tool join -w -j 15 -D
> blade02:~ # fence_tool join -w -D -j 15
> fence_tool: wait for quorum 1
> fence_tool: get our node name
> fence_tool: connect to ccs
> fence_tool: start fenced
> fenced: 1122003465 our name from cman "blade02"
This is inconsistent with the data below which shows that blade1 is a
cluster member, not blade2. Maybe you collected the other data before
blade2 joined the cluster...
> blade13:~ # cman_tool nodes
> Node Votes Exp Sts Name
> 1 1 1 M blade01
> 2 1 1 X blade02
> 3 1 1 X blade03
> 4 1 1 X blade04
> 6 1 1 X blade06
> 7 1 1 X blade07
> 8 1 1 X blade08
> 9 1 1 X blade09
> 10 1 1 X blade10
> 11 1 1 X blade11
> 12 1 1 X blade12
> 13 1 1 M blade13
> 14 1 1 X blade14
>
> blade13:~ # cman_tool status
> Protocol version: 5.0.1
> Config version: 1
> Cluster name: blade_cluster
> Cluster ID: 38068
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 2
> Expected_votes: 1
> Total_votes: 2
> Quorum: 2
> Active subsystems: 6
> Node name: blade13
>
> blade13:~ # cman_tool services
> Service Name GID LID State Code
> Fence Domain: "default" 1 2 recover 2 -
> [13]
This looks like blade13 is trying to fence some node. blade13 won't let
anyone else join the fence domain until it's completed the fencing; this
is probably why fenced on blade02 isn't getting anywhere.
/var/log/messages on blade13 should show where or if there's an incomplete
fencing operation.
Dave
More information about the Linux-cluster
mailing list