[Linux-cluster] how to recover from process_recovery_barrier status=-104
Dan B. Phung
phung at cs.columbia.edu
Fri Jul 22 04:56:12 UTC 2005
On 22, Jul, 2005, David Teigland declared:
> On Thu, Jul 21, 2005 at 11:51:21PM -0400, Dan B. Phung wrote:
> > My cluster went down pretty hard, in that I had to hard reboot several
> > machines, and now the fence daemon won't come up. I run:
> >
> > $ ccsd && cman_tool join -w
> > $ fence_tool join -w -j 15 -D
> > blade02:~ # fence_tool join -w -D -j 15
> > fence_tool: wait for quorum 1
> > fence_tool: get our node name
> > fence_tool: connect to ccs
> > fence_tool: start fenced
> > fenced: 1122003465 our name from cman "blade02"
>
> This is inconsistent with the data below which shows that blade1 is a
> cluster member, not blade2. Maybe you collected the other data before
> blade2 joined the cluster...
right, actually I exited from the fence operation and force blade02 to
leave the cluster.
> This looks like blade13 is trying to fence some node. blade13 won't let
> anyone else join the fence domain until it's completed the fencing; this
> is probably why fenced on blade02 isn't getting anywhere.
> /var/log/messages on blade13 should show where or if there's an incomplete
> fencing operation.
here's some excerpts from /var/log/messages:
Jul 21 16:48:05 blade13 kernel: qla2300 0000:02:02.0: LOOP DOWN detected.
Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code =
0x10000
Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector
69569288
Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code =
0x10000
Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector
69569296
Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code =
0x10000
Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector
69569304
Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code =
0x10000
Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector
69569312
Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code =
0x10000
Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector
69569320
Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code =
0x10000
Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector
69569328
Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code =
0x10000
Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector
69569336
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
fatal: I/O error
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
block = 8696119
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
function = gfs_logbh_wait
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
file = /usr/local/src/cluster-2.6.8.1/gfs-kernel/src/gfs/dio.c, line = 923
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
time = 1121978916
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
about to withdraw from the cluster
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
waiting for outstanding I/O
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
telling LM to withdraw
Jul 21 16:48:37 blade13 kernel: lock_dlm: withdraw abandoned memory
Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
withdrawn
Jul 21 16:49:33 blade13 kernel: qla2300 0000:02:02.0: LOOP UP detected (2
Gbps).
Jul 21 17:01:34 blade13 shutdown[7987]: shutting down for system reboot
--
snipped reboot messages
--
Jul 21 17:04:17 blade13 kernel: CMAN: Waiting to join or form a
Linux-cluster
Jul 21 17:04:20 blade13 kernel: CMAN: sending membership request
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade12
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade04
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade09
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade03
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade02
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade06
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade07
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade08
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade11
Jul 21 17:04:21 blade13 kernel: CMAN: got node blade01
Jul 21 17:04:24 blade13 clvmd: Cluster LVM daemon started - connected to
CMAN
Jul 21 17:04:24 blade13 kernel: CMAN: WARNING no listener for port 11 on
node blade01
Jul 21 17:18:16 blade13 kernel: GFS: Trying to join cluster "lock_dlm",
"blade_cluster:lil_cheesy1_lv"
Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
Joined cluster. Now mounting FS...
Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
jid=0: Trying to acquire journal lock...
Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
jid=0: Looking at journal...
Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0:
jid=0: Done
(last message repeated 13 times)
Jul 21 23:14:57 blade13 kernel: CMAN: node blade04 rejoining
Jul 21 23:16:52 blade13 kernel: CMAN: node blade12 rejoining
Jul 21 23:21:16 blade13 kernel: CMAN: node blade12 has been removed from
the cluster : Shutdown
Jul 21 23:23:02 blade13 kernel: CMAN: node blade02 has been removed from
the cluster : Missed too many heartbeats
Jul 21 23:23:03 blade13 kernel: SM: 00000001 process_recovery_barrier
status=-104
Jul 21 23:23:27 blade13 kernel: CMAN: node blade03 has been removed from
the cluster : Missed too many heartbeats
Jul 21 23:23:28 blade13 kernel: SM: 00000001 process_recovery_barrier
status=-104
Jul 21 23:24:12 blade13 kernel: CMAN: node blade06 has been removed from
the cluster : Missed too many heartbeats
Jul 21 23:24:13 blade13 kernel: SM: 00000001 process_recovery_barrier
status=-104
Jul 21 23:24:33 blade13 kernel: CMAN: node blade09 has been removed from
the cluster : No response to messages
Jul 21 23:24:43 blade13 kernel: CMAN: removing node blade08 from the
cluster : No response to messages
Jul 21 23:24:43 blade13 kernel: CMAN: removing node blade07 from the
cluster : No response to messages
Jul 21 23:24:53 blade13 kernel: SM: 00000001 process_recovery_barrier
status=-104
> > blade13:~ # cman_tool nodes
> > Node Votes Exp Sts Name
> > 1 1 1 M blade01
> > 2 1 1 X blade02
> > 3 1 1 X blade03
> > 4 1 1 X blade04
> > 6 1 1 X blade06
> > 7 1 1 X blade07
> > 8 1 1 X blade08
> > 9 1 1 X blade09
> > 10 1 1 X blade10
> > 11 1 1 X blade11
> > 12 1 1 X blade12
> > 13 1 1 M blade13
> > 14 1 1 X blade14
> >
> > blade13:~ # cman_tool status
> > Protocol version: 5.0.1
> > Config version: 1
> > Cluster name: blade_cluster
> > Cluster ID: 38068
> > Cluster Member: Yes
> > Membership state: Cluster-Member
> > Nodes: 2
> > Expected_votes: 1
> > Total_votes: 2
> > Quorum: 2
> > Active subsystems: 6
> > Node name: blade13
> >
> > blade13:~ # cman_tool services
> > Service Name GID LID State Code
> > Fence Domain: "default" 1 2 recover 2 -
> > [13]
>
>
More information about the Linux-cluster
mailing list