[Linux-cluster] GFS hangs, nodes die

Sun Aug 19 20:12:43 UTC 2007

Hi Sebastian,
you might also want to have a look at here:
http://www.open-sharedroot.org/Members/marc/blog/blog-on-gfs/
I collected some information about the problem you've hit (It must be that 
problem).
Next time you should also look at the console of every node. You should see 
some intesting messages before there.
Use the glock_purge gfs_tool option it will help and always keep a look on the 
gfs_tool counters and there on the locks.

BTW: the unable to obtain lock is only the rgmanager complaining about not 
being able to obtain a lock and as side effect. The problem is that a new 
lockid cannot be got within time.

Regards Marc.
On Sunday 19 August 2007 11:53:39 you wrote:
> Hi Marc!
>
> Thanks for your help. As I restarted everything now, I can't check this.
> I will do when it's crahsing again (I will do some tests now). I
> realised that one node did hang with kernel panic. Attached is the
> screenshot.
>
> regards
> sebastian
>
> Marc Grimme wrote:
> > Hello Sebastian,
> > what do gfs_tool counters on the fs tell you?
> > And ps axf? Do you have a lot of "D" processes?
> > Regards Marc.
> >
> > On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote:
> >> Dear list,
> >>
> >> this is the tragical story of my cluster running rhel/csgfs 4u5: the
> >> cluster in generally is running fine, but when I increase the load to a
> >> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
> >> (not reacting any more, but no sign of kernel panic), the others can't
> >> access the gfs resource.
> >> Gfs is set up as a rgmanager service with failover domain for each node
> >> (same problem also exists when mounting via /etc/fstab).
> >>
> >> Who is willing to provide a happy end?
> >>
> >> Thanks, Sebastian
> >> **
> >>
> >> This is what /var/log/messages gives me (on nearly all nodes):
> >> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting
> >> status for RG gfs-2
> >> and e.g.
> >> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
> >> cluster lock: Connection timed out
> >>
> >> [root at compute-0-3 ~]# cat /proc/cluster/status
> >> Protocol version: 5.0.1
> >> Config version: 53
> >> Cluster name: dtm
> >> Cluster ID: 741
> >> Cluster Member: Yes
> >> Membership state: Cluster-Member
> >> Nodes: 10
> >> Expected_votes: 11
> >> Total_votes: 10
> >> Quorum: 6
> >> Active subsystems: 8
> >> Node name: compute-0-3
> >> Node ID: 4
> >> Node addresses: 10.1.255.252
> >>
> >> [root at compute-0-6 ~]# cat /proc/cluster/services
> >> Service          Name                              GID LID State    
> >> Code Fence Domain:    "default"                           3   2 recover
> >> 4 - [1 2 6 10 9 8 3 7 4 11]
> >> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
> >> [1 2 6 10 9 8 3 7 4 11]
> >> DLM Lock Space:  "Magma"                            12   5 recover 0 -
> >> [1 2 6 10 9 8 3 7 4 11]
> >> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
> >> [10 9 8 7 2 3 6 4 1 11]
> >> GFS Mount Group: "homeneu"                          18   7 recover 0 -
> >> [10 9 8 7 2 3 6 4 1 11]
> >> User:            "usrm::manager"                    11   4 recover 0 -
> >> [1 2 6 10 9 8 3 7 4 11]
> >>
> >> [root at compute-0-10 ~]# cat /proc/cluster/dlm_stats
> >> DLM stats (HZ=1000)
> >>
> >> Lock operations:       4036
> >> Unlock operations:     2001
> >> Convert operations:    1862
> >> Completion ASTs:       7898
> >> Blocking ASTs:           52
> >>
> >> Lockqueue        num  waittime   ave
> >> WAIT_RSB        3778     28862     7
> >> WAIT_CONV         75       482     6
> >> WAIT_GRANT      2171      7235     3
> >> WAIT_UNLOCK      153      1606    10
> >> Total           6177     38185     6
> >>
> >> [root at compute-0-10 ~]# cat /proc/cluster/sm_debug
> >> sevent state 7
> >> 02000012 sevent state 9
> >> 00000003 remove node 5 count 10
> >> 01000011 remove node 5 count 10
> >> 0100000c remove node 5 count 10
> >> 01000007 remove node 5 count 10
> >> 02000012 remove node 5 count 10
> >> 0300000b remove node 5 count 10
> >> 00000003 recover state 0
> >>
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss