[Linux-cluster] how to recover from process_recovery_barrier status=-104
Dan B. Phung
phung at cs.columbia.edu
Fri Jul 22 03:51:21 UTC 2005
My cluster went down pretty hard, in that I had to hard reboot several
machines, and now the fence daemon won't come up. I run:
$ ccsd && cman_tool join -w
$ fence_tool join -w -j 15 -D
blade02:~ # fence_tool join -w -D -j 15
fence_tool: wait for quorum 1
fence_tool: get our node name
fence_tool: connect to ccs
fence_tool: start fenced
fenced: 1122003465 our name from cman "blade02"
fenced: 1122003465 delay post_join 15s post_fail 0s
fenced: 1122003465 added 14 nodes from ccs
and it hangs there forever until I hit ^C.
On one of the surviving machines, I see (dmesg):
SM: 00000001 process_recovery_barrier status=-104
CMAN: node blade03 has been removed from the cluster : Missed too many
heartbeats
SM: 00000001 process_recovery_barrier status=-104
CMAN: node blade06 has been removed from the cluster : Missed too many
heartbeats
SM: 00000001 process_recovery_barrier status=-104
CMAN: node blade09 has been removed from the cluster : No response to
messages
CMAN: bad generation number 371 in HELLO message from 1, expected 370
CMAN: removing node blade08 from the cluster : No response to messages
CMAN: removing node blade07 from the cluster : No response to messages
CMAN: quorum lost, blocking activity
SM: 00000001 process_recovery_barrier status=-104
Is there a way to recover (restart gfs) without having to reboot this
last machine?
thanks,
dan
p.s. here's some more info:
blade13:~ # cman_tool nodes
Node Votes Exp Sts Name
1 1 1 M blade01
2 1 1 X blade02
3 1 1 X blade03
4 1 1 X blade04
6 1 1 X blade06
7 1 1 X blade07
8 1 1 X blade08
9 1 1 X blade09
10 1 1 X blade10
11 1 1 X blade11
12 1 1 X blade12
13 1 1 M blade13
14 1 1 X blade14
blade13:~ # cman_tool status
Protocol version: 5.0.1
Config version: 1
Cluster name: blade_cluster
Cluster ID: 38068
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 1
Total_votes: 2
Quorum: 2
Active subsystems: 6
Node name: blade13
blade13:~ # cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 recover 2 -
[13]
DLM Lock Space: "clvmd" 2 3 recover 0 -
[13]
DLM Lock Space: "lil_cheesy1_lv" 11 4 run -
[13]
GFS Mount Group: "lil_cheesy1_lv" 12 5 run -
[13]
More information about the Linux-cluster
mailing list