[Linux-cluster] GFS2 cluster node is running very slow
Steven Whitehouse
swhiteho at redhat.com
Thu Mar 31 14:24:46 UTC 2011
Hi,
On Thu, 2011-03-31 at 10:14 -0400, David Hill wrote:
> These directories are all on the same mount ... with a total size of 1.2TB!
> /mnt/gfs is the mount
> /mnt/gfs/scripts/appl01
> /mnt/gfs/scripts/appl02
> /mnt/gfs/scripts/appl03
> /mnt/gfs/scripts/appl04
> /mnt/gfs/scripts/appl05
> /mnt/gfs/scripts/appl06
> /mnt/gfs/scripts/appl07
> /mnt/gfs/scripts/appl08
>
> All files accessed by the application are within it's own folder/subdirectory.
> No files is ever accessed by more than one node.
>
> I'm going to suggest to split but this also bring another issue:
>
> - We have a daily GFS lockout now... We need to reboot the whole cluster to solve the issue.
>
I'm not sure what you mean by that. What actually happens? Is it just
the filesystem that goes slow? Do you get any messages
in /var/log/messages do any nodes get fenced or does that fail too?
Steve.
> This is going bad.
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alan Brown
> Sent: 31 mars 2011 07:21
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS2 cluster node is running very slow
>
> David Hill wrote:
> > Hi Steve,
> >
> > We seems to be experiencing some new issues now... With 4 nodes, only one is slow but with 3 nodes, 2 of them are now slow.
> > 2 nodes are doing 20k/s and one is doing 2mb/s ... Seems like all nodes will end up with poor performances.
> > All nodes are locking files in their own directory /mnt/application/tomcat-1, /mnt/application/tomcat-2 ...
>
> Just to clarify:
>
> Are these directories on the same filesystem or are they on individual
> filesystems?
>
> If the former, try splitting into separate filesystems.
>
> Remember that one node will become the filesystem master and everything
> else will be slower when accessing that filesystem.
>
> > I'm out of ideas on this one.
> >
> > Dave
> >
> >
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Hill
> > Sent: 30 mars 2011 11:42
> > To: linux clustering
> > Subject: Re: [Linux-cluster] GFS2 cluster node is running very slow
> >
> > Hi Steve,
> >
> > I think you're right about the the glock ... There was MANY more of these.
> > We're using a new server with totally different hardware. We've done many test
> > before posting to the mailing list like:
> > - copy files from the problematic node to the other nodes without using the problematic mount, everything is fine (7MB/s)
> > - read from the problematic mount on the "broken" node is fine too (21MB/s)
> > So, at this point, I doubt the problem is the network infrastructure behind the node (or the network adapter) because everything is going smooth on all aspect BUT
> > we cannot use the /mnt on the broken node because it's not usable. Last time I tried to copy a file to that /mnt it was doing 5k/s while
> > all the other nodes are doing ok at 7MB/s ...
> >
> > Whenever we do the test, it doesn't seem to go higher than 200k/s ...
> >
> > But still, we can transfer to all nodes at a decent speed from that host.
> > We can transfer to the SAN at a decent speed.
> >
> > CPU is 0% used.
> > Memory is 50% used.
> > Network is 0% used.
> >
> > Only difference between that host and the others is that the mysql database is hosted locally and storage is on the same SAN ... but even with this,
> > Mysqld is using only 2mbit/s on the loopback, a little bit of memory and mostly NO CPU .
> >
> >
> > Here is a capture of the system:
> > top - 15:39:51 up 7:40, 1 user, load average: 0.08, 0.13, 0.11
> > Tasks: 343 total, 1 running, 342 sleeping, 0 stopped, 0 zombie
> > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu1 : 0.1%us, 0.0%sy, 0.0%ni, 99.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu2 : 0.1%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu3 : 0.2%us, 0.0%sy, 0.0%ni, 99.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu7 : 0.1%us, 0.0%sy, 0.0%ni, 99.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu8 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu9 : 0.1%us, 0.0%sy, 0.0%ni, 99.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu10 : 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu13 : 0.2%us, 0.0%sy, 0.0%ni, 99.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu14 : 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu15 : 0.4%us, 0.1%sy, 0.0%ni, 99.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu16 : 0.1%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu17 : 0.4%us, 0.1%sy, 0.0%ni, 99.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu18 : 0.2%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu19 : 0.6%us, 0.1%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu20 : 0.2%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu21 : 0.6%us, 0.1%sy, 0.0%ni, 99.2%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
> > Cpu22 : 0.2%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> > Cpu23 : 0.1%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
> > Mem: 32952896k total, 2453956k used, 30498940k free, 256648k buffers
> > Swap: 4095992k total, 0k used, 4095992k free, 684160k cached
> >
> >
> > It's a monster for what it does. Could it be possible that it's soo much more performant than the other nodes that it kills itself?
> >
> > The servers is Centos 5.5 .
> > The filesystem if 98% full (31G remaining on 1.2T) ... but if that is an issue, why does all other nodes running smoothly and having no issues but that one?
> >
> >
> > Thank you for the reply,
> >
> > Dave
> >
> >
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steven Whitehouse
> > Sent: 30 mars 2011 07:48
> > To: linux clustering
> > Subject: Re: [Linux-cluster] GFS2 cluster node is running very slow
> >
> > Hi,
> >
> > On Wed, 2011-03-30 at 01:34 -0400, David Hill wrote:
> >> Hi guys,
> >>
> >>
> >>
> >> I’ve found this in /sys/kernel/debug/gfs2/fsname/glocks
> >>
> >>
> >>
> >> H: s:EX f:tW e:0 p:22591 [jsvc] gfs2_inplace_reserve_i+0x451/0x69a
> >> [gfs2]
> >>
> >> H: s:EX f:tW e:0 p:22591 [jsvc] gfs2_inplace_reserve_i+0x451/0x69a
> >> [gfs2]
> >>
> >> H: s:EX f:W e:0 p:806 [pdflush] gfs2_write_inode+0x57/0x152 [gfs2]
> >>
> > This doesn't mean anything without a bit more context. Were these all
> > queued against the same glock? If so which glock was it?
> >
> >>
> >>
> >> The application running is confluence and has 184 thread. The other
> >> nodes work fine but that specific node is having issues obtaining
> >> locks when it’s time to write?
> >>
> > That does sound a bit strange. Are you using a different network card on
> > the slow node? Have you checked to see if there is too much traffic on
> > that network link?
> >
> > Also, how full was the filesystem and which version of GFS2 are you
> > using (i.e. RHELx, Fedora X or CentOS or....)?
> >
> >
> > Steve.
> >
> >>
> >>
> >> Dave
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: linux-cluster-bounces at redhat.com
> >> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Hill
> >> Sent: 29 mars 2011 21:00
> >> To: linux-cluster at redhat.com
> >> Subject: [Linux-cluster] GFS2 cluster node is running very slow
> >>
> >>
> >>
> >>
> >> Hi guys,
> >>
> >>
> >>
> >> We have a GFS2 cluster consisting of 3 nodes. At this
> >> point, everything is going smooth. Now, we add a new node with more
> >> CPUs with the
> >>
> >> exact same configuration but all transactions on the mount run very
> >> slow.
> >>
> >>
> >>
> >> Copying a file to the mount is done at about 25kb/s when on the three
> >> other nodes, everything goes smooth at about 7MB/s.
> >>
> >> CPU on all nodes is idling at some point, all cluster process are kind
> >> of sleeping.
> >>
> >>
> >>
> >> We’ve tried the ping_pong.c from apache and it seems to be able to
> >> write/read lock files at a decent rate.
> >>
> >>
> >>
> >> There’s other mounts on the system using the same fc
> >> card/fibers/switches/san and all these are also working at a decent
> >> speed...
> >>
> >>
> >>
> >> I’ve been reading a good part of the day, and I can’t seem to find a
> >> solution.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> ubisoft_logo
> >>
> >> David C. Hill
> >>
> >> Linux System Administrator - Enterprise
> >>
> >> 514-490-2000#5655
> >>
> >> http://www.ubi.com
> >>
> >>
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list