[Linux-cluster] Tibco EMS on GFS2

Mon Feb 13 14:15:46 UTC 2012

Hi,

On Mon, 2012-02-13 at 15:06 +0100, Laszlo Beres wrote:
> Hi Steven,
> 
> On Mon, Feb 13, 2012 at 10:38 AM, Steven Whitehouse <swhiteho at redhat.com> wrote:
> 
> > I'd be very interested to know what has not worked for you. Please open
> > a ticket with our support team if you are a Red Hat customer. We are
> > keen to ensure that people don't run into such issues, but we'll need a
> > bit more information in order to investigate,
> 
> Well, we're still in the investigation phase, so as long as I don't
> have direct evidences in my hand I don't want to bother Red Hat
> support. Let me briefly summarize our case.
> 
It is still worth talking to our support team, since they may well be
able to suggest things to look into, or may have solved a similar
problem. They are there to assist even if you don't actually have a bug
as such to report.

> I set up a two noded cluster both on RHEL 5.7 x86_64 on HP DL380 G7
> (96 GB RAM, 4 Intel Xeon X5675 / 6 cores) with EMC VMAX storage (on
> dm-multipath). We launched our Tibco service in December which has
> been working so far without any issues. A few days ago our customers
> reported that the service is "slow" - as a system engineer I could not
> really do anything with such a statement, so our application guys
> deployed a small tool which sends messages to the message bus, and
> also measures the message delivery delay. Most of the times it's
> around 1-2 ms, but surprisingly found extra high values, which have
> affect on dependant applications:
> 
> 2012-02-08 22:12:31,919 INFO  [Main] Message sent in: 1ms
> 2012-02-08 22:12:33,974 ERROR [Main] Message sent in: 1053ms
> 2012-02-08 22:12:34,978 INFO  [Main] Message sent in: 2ms
> 2012-02-08 22:12:35,980 INFO  [Main] Message sent in: 1ms
> 
> At the same time iostat show high utilization:
> 
> 2012-02-08 22:12:33 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> 2012-02-08 22:12:33 0,12    0,00    0,37    4,00    0,00   95,51
> 2012-02-08 22:12:33
> 2012-02-08 22:12:33 Device:         rrqm/s   wrqm/s   r/s   w/s
> rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> 2012-02-08 22:12:33 sdb               0,00     0,00  0,00  0,00
> 0,00     0,00     0,00     0,00    0,00   0,00   0,00
> 2012-02-08 22:12:33 sde               0,00     0,00  0,00  0,00
> 0,00     0,00     0,00     0,00    0,00   0,00   0,00
> 2012-02-08 22:12:33 sdh               0,00     0,00  0,00  0,00
> 0,00     0,00     0,00     0,00    0,00   0,00   0,00
> 2012-02-08 22:12:33 sdk               0,00     0,00  0,00  0,00
> 0,00     0,00     0,00     0,00    0,00   0,00   0,00
> 2012-02-08 22:12:33 sdn               0,00     0,00  0,00  0,00
> 0,00     0,00     0,00     0,00    0,00   0,00   0,00
> 2012-02-08 22:12:33 sdq               0,00     0,00  0,00  0,00
> 0,00     0,00     0,00     0,00    0,00   0,00   0,00
> 2012-02-08 22:12:33 sdt               0,00     0,00 56,00  0,00
> 0,34     0,00    12,43     0,12    2,21   2,20  12,30
> 2012-02-08 22:12:33 sdw               2,00    39,00 337,00 28,00
> 1,84     0,12    10,98     2,40    2,75   2,34  85,40
> 2012-02-08 22:12:33 dm-5              0,00     0,00 394,00 201,00
> 2,18     0,79    10,19     2,98    1,94   1,65  97,90
> 
> >From our service point of view these are off-peak times, less messages
> were directed to EMS.
> 
> All unimportant services are disabled, no jobs are scheduled in system.
> 
> Also the occurence pattern is quite strange:
> 
> 2012-02-10 03:01:24,333 ERROR [Main] Message sent in: 2306ms
> 2012-02-10 03:02:04,953 ERROR [Main] Message sent in: 1221ms
> 2012-02-10 03:11:29,725 ERROR [Main] Message sent in: 1096ms
> 2012-02-10 03:31:35,195 ERROR [Main] Message sent in: 1051ms
> 2012-02-10 03:36:36,943 ERROR [Main] Message sent in: 1263ms
> 2012-02-10 04:01:17,059 ERROR [Main] Message sent in: 1585ms
> 2012-02-10 04:01:41,953 ERROR [Main] Message sent in: 1790ms
> 2012-02-10 04:02:42,953 ERROR [Main] Message sent in: 1307ms
> 2012-02-10 04:06:03,181 ERROR [Main] Message sent in: 2305ms
> 2012-02-10 04:08:48,844 ERROR [Main] Message sent in: 1294ms
> 2012-02-10 04:12:52,282 ERROR [Main] Message sent in: 1350ms
> 2012-02-10 05:01:00,411 ERROR [Main] Message sent in: 1143ms
> 2012-02-10 06:01:01,550 ERROR [Main] Message sent in: 1291ms
> 2012-02-10 06:01:48,957 ERROR [Main] Message sent in: 1092ms
> 
> 
Do you have any backup scripts running and/or any other cron jobs which
might touch the GFS2 filesystem at certain times? That is usually the
first thing to look into,

Steve.