[Linux-cluster] Rhel 5.7 Cluster - gfs2 volume in "LEAVE_START_WAIT" status

Cedric Kimaru rhel_cluster at ckimaru.com
Sun Jun 3 01:25:23 UTC 2012


Fellow Cluster Compatriots,
I'm looking for some guidance here. Whenever my rhel 5.7 cluster get's into
"*LEAVE_START_WAIT*" on on a given iscsi volume, the following occurs:

   1. I can't r/w io to the volume.
   2. Can't unmount it, from any node.
   3. In flight/pending IO's are impossible to determine or kill since lsof
   on the mount fails. Basically all IO operations stall/fail.

So my questions are:

   1. What does the output from group_tool -v really indicate, *"00030005
   LEAVE_START_WAIT 12 c000b0002 1" *? Man on group_tool doesn't list these
   fields.
   2. Does anyone have a list of what these fields represent ?
   3. Corrective actions. How do i get out of this state without rebooting
   the entire cluster ?
   4. Is it possible to determine the offending node ?

thanks,
-Cedric


//misc output

root at bl13-node13:~# clustat
Cluster Status for cluster3 @ Sat Jun  2 20:47:08 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
bl01-node01                                      1 Online, rgmanager
 bl04-node04                                      4 Online, rgmanager
 bl05-node05                                      5 Online, rgmanager
 bl06-node06                                      6 Online, rgmanager
 bl07-node07                                      7 Online, rgmanager
 bl08-node08                                      8 Online, rgmanager
 bl09-node09                                      9 Online, rgmanager
 bl10-node10                                     10 Online, rgmanager
 bl11-node11                                     11 Online, rgmanager
 bl12-node12                                     12 Online, rgmanager
 bl13-node13                                     13 Online, Local, rgmanager
 bl14-node14                                     14 Online, rgmanager
 bl15-node15                                     15 Online, rgmanager


 Service Name                                                 Owner
(Last)                                                 State
 ------- ----                                                 -----
------                                                 -----
 service:httpd
bl05-node05                               started
 service:nfs_disk2
bl08-node08                               started


root at bl13-node13:~# group_tool -v
type             level name            id       state node id local_done
fence            0     default         0001000d none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     clvmd           0001000c none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk1  00020005 none
[4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk2  00040005 none
[4 5 6 7 8 9 10 11 13 14 15]
dlm              1     cluster3_disk7  00060005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk8  00080005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk9  000a0005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     disk10          000c0005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     rgmanager       0001000a none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
dlm              1     cluster3_disk3  00020001 none
[1 5 6 7 8 9 10 11 12 13]
dlm              1     cluster3_disk6  00020008 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk1  00010005 none
[4 5 6 7 8 9 10 11 12 13 14 15]
*gfs              2     cluster3_disk2  00030005 LEAVE_START_WAIT 12
c000b0002 1
[4 5 6 7 8 9 10 11 13 14 15]*
gfs              2     cluster3_disk7  00050005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk8  00070005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk9  00090005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     disk10          000b0005 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]
gfs              2     cluster3_disk3  00010001 none
[1 5 6 7 8 9 10 11 12 13]
gfs              2     cluster3_disk6  00010008 none
[1 4 5 6 7 8 9 10 11 12 13 14 15]

root at bl13-node13:~# gfs2_tool list
253:15 cluster3:cluster3_disk6
253:16 cluster3:cluster3_disk3
253:18 cluster3:disk10
253:17 cluster3:cluster3_disk9
253:19 cluster3:cluster3_disk8
253:21 cluster3:cluster3_disk7
253:22 cluster3:cluster3_disk2
253:23 cluster3:cluster3_disk1

root at bl13-node13:~# lvs
    Logging initialised at Sat Jun  2 20:50:03 2012
    Set umask from 0022 to 0077
    Finding all logical volumes
  LV                            VG                            Attr
LSize   Origin Snap%  Move Log Copy%  Convert
  lv_cluster3_Disk7             vg_Cluster3_Disk7             -wi-ao
3.00T
  lv_cluster3_Disk9             vg_Cluster3_Disk9             -wi-ao
200.01G
  lv_Cluster3_libvert           vg_Cluster3_libvert           -wi-a-
100.00G
  lv_cluster3_disk1             vg_cluster3_disk1             -wi-ao
100.00G
  lv_cluster3_disk10            vg_cluster3_disk10            -wi-ao
15.00T
  lv_cluster3_disk2             vg_cluster3_disk2             -wi-ao
220.00G
  lv_cluster3_disk3             vg_cluster3_disk3             -wi-ao
330.00G
  lv_cluster3_disk4_1T-kvm-thin vg_cluster3_disk4_1T-kvm-thin -wi-a-
1.00T
  lv_cluster3_disk5             vg_cluster3_disk5             -wi-a-
555.00G
  lv_cluster3_disk6             vg_cluster3_disk6             -wi-ao
2.00T
  lv_cluster3_disk8             vg_cluster3_disk8             -wi-ao
2.00T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120602/0d5cd131/attachment.htm>


More information about the Linux-cluster mailing list