[Spacewalk-list] Monitoring and notifications
David Nutter
davidn at bioss.sari.ac.uk
Thu Oct 22 14:12:40 UTC 2009
On Thu, Oct 22, 2009 at 11:14:15AM +0200, Miroslav Suchý wrote:
> David Nutter wrote:
> >Any thoughts on further debugging steps? I'm rather confused about the
> >relationships between the three notifier scripts which isn't
> >helping. Any insight gratefully received :)
>
> The relationship is following:
> notif-escalator -- is in charge of tracking alerts and sends and doing
> the escalation.
> notif-launcher -- reads new alers from inbound queue, process redirects
> and register the alerts and initial sends with the notif-escalator
> notifier -- remote program, which pools the notif-escalator for new
> sends and deliver these sends via smtp and snmp (via database and scout).
>
> So my guess is that notif-launcher is somehow broken. If I had to guess
> -- I would check if notif-launcher check for correct directory.
>
> Unfortunately I have currently no time to investigate it myself.
Thanks for the information. As a result I've made some further
investigations myself, and got it working again with a hack or
two. I'm now not sure when monitoring stopped sending notifications,
it may be that the few boxes we were monitoring didn't generate any
alerts throughout the lifecycle of Spacewalk 0.5. We're monitoring
more things now and consequently get more alerts.
Anyway, findings:
1) There appears to be bugs in NOCPulse::Notif::FileQueue->_filelist
and NOCPulse::Notif::ContactGroup->add_destination. See notes below for
details. I'm not sure if this is a problem with my environment or a
true bug in NOCpulse. If desired I can provide a patch - about 3
lines of changes in total.
Relevant packages:
NPalert-1.126.9-1.el5
perl-5.8.8-18.el5_3.1
perl-Class-MethodMaker-2.08-4.el5
2) If files in /etc/notification are owned by anyone other than the
nocpulse user and group the generation of monitoring config
fails. Consequently, I was using an old config which was adding to the
weirdness. Workaround is obviously to give nocpulse control over
that directory; this should probably be done by the
rhn-update-monitoring.pl script. Perhaps GenerateNotifConfig should
die if it can't write to the config files rather than just logging
the problem?
3) notif-launcher will try to queue a send even if there are no
send objects associated with the alert (i.e. Alert->create_initial_sends
returns nothing). This causes confusing
log messages about the Escalator interface.
Long stream-of-conciousness debugging notes follow.
First problem: NOCPulse::Notif::FileQueue doesn't seem to work for
me. The problem appears to be this line in the _filelist method:
@{$self->_files}=map { $self->directory . "/$_" } @files;
With the line as is, _files just ends up empty. If I hack it to:
my @newfiles=map { $self->directory . "/$_" } @files;
$self->{_files}=\@newfiles;
after a monitoring restart FileQueue starts to work, and
notif-launcher picks up the alerts in the queue. I get the following
in notif-launcher.log:
2009-10-22 08:25:30 Started Alert [] 01_1256199930_019314_003
2009-10-22 08:25:30 Completed Alert [] 01_1256199930_019314_003
2009-10-22 08:25:30 NOCpulse::Notif::FileQueue::peek /var/lib/notification/queue/alert_queue/01_1256199930_019314_003 no longer exists -- dequeuing
However, emails are still not sent. Looking on the bright side, at
least the alerts get dequeued so no more "Notification Meltdown"
messages. I added some more logging to notif-launcher so I could check
the return value of NOCPulse::Notif::Alert->process_redirects. When an
alert occurs I now get:
2009-10-22 09:40:40 Started Alert [] 01_1256204439_012135_001
2009-10-22 09:40:40 Process alerts returned contact group 21 not found
2009-10-22 09:40:40 Destinations now $VAR1 = [];
2009-10-22 09:40:40 Completed Alert [] 01_1256204439_012135_001
2009-10-22 09:40:40 NOCpulse::Notif::FileQueue::peek /var/lib/notification/queue/alert_queue/01_1256204439_012135_001 no longer exists -- dequeuing
Aha, so it can't find my contact group. Explains why no emails get
sent. Looking in /etc/notification I found that all the files in
/etc/notification/generated were owned by UID 502:502, not nocpulse:nocpulse
and were very out of date. I assume this is just an artefact of an
upgrade. After a chown and a restart of monitoring, an attempt is made
to pass alerts to the escalator, this fails with log messages like
2009-10-22 10:40:54 ERROR: Cannot create sends /var/tmp/01_1256208054_014710_001. Skipping alert. Escalator
Interface error: 2009-10-22 10:40:54 NOCpulse::Notif::FileQueue::skip (skip) file is /var/lib/notification/queue/alert_queue/01_1256208054_014710_001
The notif-escalator.log has corresponding messages
2009-10-22 10:40:54 Registered Alert [00004] /var/tmp/01_1256208054_014710_001
2009-10-22 10:40:54 Registering sends for Alert [00004] 01_1256208054_014710_001 send count = 0
So though the launcher is talking to the escalator, it isn't actually
registering any sends, because create_initial_sends is returning an
empty list. This leads to the confusing "Interface error" above even
though there's no actual interface error. I think notif-launcher should check
the length of @sends before calling register_sends.
Email still doesn't send though because ContactGroup->destinations is
empty. I traced this to the following line in
NOCpulse::Notif::ContactGroup:
push(@{$self->destinations},$destination);
It's the same idiom that causes problems in FileQueue. When I replace
it with this:
$self->push_destinations($destination);
restart monitoring and create an alert I get an email at last.
Regards,
--
David Nutter Tel: +44 (0)131 650 4888
BioSS, JCMB, King's Buildings, Mayfield Rd, EH9 3JZ. Scotland, UK
Biomathematics and Statistics Scotland (BioSS) is formally part of The
Scottish Crop Research Institute (SCRI), a registered Scottish charity
No. SC006662
More information about the Spacewalk-list
mailing list