[Spacewalk-list] Monitoring and notifications

David Nutter davidn at bioss.sari.ac.uk
Thu Oct 22 14:12:40 UTC 2009


On Thu, Oct 22, 2009 at 11:14:15AM +0200, Miroslav Suchý wrote:
> David Nutter wrote:
> >Any thoughts on further debugging steps? I'm rather confused about the
> >relationships between the three notifier scripts which isn't
> >helping. Any insight gratefully received :)
> 
> The relationship is following:
> notif-escalator -- is in charge of tracking alerts and sends and doing 
> the escalation.
> notif-launcher -- reads new alers from inbound queue, process redirects 
> and register the alerts and initial sends with the notif-escalator
> notifier -- remote program, which pools the notif-escalator for new 
> sends and deliver these sends via smtp and snmp (via database and scout).
> 
> So my guess is that notif-launcher is somehow broken. If I had to guess 
> -- I would check if notif-launcher check for correct directory.
> 
> Unfortunately I have currently no time to investigate it myself.

Thanks for the information. As a result I've made some further
investigations myself, and got it working again with a hack or
two. I'm now not sure when monitoring stopped sending notifications,
it may be that the few boxes we were monitoring didn't generate any
alerts throughout the lifecycle of Spacewalk 0.5. We're monitoring
more things now and consequently get more alerts.

Anyway, findings:

1) There appears to be bugs in NOCPulse::Notif::FileQueue->_filelist
   and NOCPulse::Notif::ContactGroup->add_destination. See notes below for
   details. I'm not sure if this is a problem with my environment or a
   true bug in NOCpulse. If desired I can provide a patch - about 3
   lines of changes in total. 

   Relevant packages:

   NPalert-1.126.9-1.el5
   perl-5.8.8-18.el5_3.1
   perl-Class-MethodMaker-2.08-4.el5

2) If files in /etc/notification are owned by anyone other than the
   nocpulse user and group the generation of monitoring config
   fails. Consequently, I was using an old config which was adding to the
   weirdness. Workaround is obviously to give nocpulse control over
   that directory; this should probably be done by the
   rhn-update-monitoring.pl script. Perhaps GenerateNotifConfig should
   die if it can't write to the config files rather than just logging
   the problem? 

3) notif-launcher will try to queue a send even if there are no
   send objects associated with the alert (i.e. Alert->create_initial_sends
   returns nothing). This causes confusing
   log messages about the Escalator interface.




Long stream-of-conciousness debugging notes follow. 

First problem: NOCPulse::Notif::FileQueue doesn't seem to work for
me. The problem appears to be this line in the _filelist method:

  @{$self->_files}=map { $self->directory . "/$_" } @files;

With the line as is, _files just ends up empty. If I hack it to:

  my @newfiles=map { $self->directory . "/$_" } @files;
  $self->{_files}=\@newfiles;

after a monitoring restart FileQueue starts to work, and
notif-launcher picks up the alerts in the queue. I get the following
in notif-launcher.log:

2009-10-22 08:25:30 Started Alert [] 01_1256199930_019314_003 
2009-10-22 08:25:30 Completed Alert [] 01_1256199930_019314_003 
2009-10-22 08:25:30 NOCpulse::Notif::FileQueue::peek /var/lib/notification/queue/alert_queue/01_1256199930_019314_003 no longer exists -- dequeuing


However, emails are still not sent. Looking on the bright side, at
least the alerts get dequeued so no more "Notification Meltdown"
messages. I added some more logging to notif-launcher so I could check
the return value of NOCPulse::Notif::Alert->process_redirects. When an
alert occurs I now get:

2009-10-22 09:40:40 Started Alert [] 01_1256204439_012135_001 
2009-10-22 09:40:40 Process alerts returned contact group 21 not found
2009-10-22 09:40:40 Destinations now $VAR1 = [];
2009-10-22 09:40:40 Completed Alert [] 01_1256204439_012135_001 
2009-10-22 09:40:40 NOCpulse::Notif::FileQueue::peek /var/lib/notification/queue/alert_queue/01_1256204439_012135_001 no longer exists -- dequeuing

Aha, so it can't find my contact group. Explains why no emails get
sent. Looking in /etc/notification I found that all the files in
/etc/notification/generated were owned by UID 502:502, not nocpulse:nocpulse
and were very out of date. I assume this is just an artefact of an
upgrade. After a chown and a restart of monitoring, an attempt is made
to pass alerts to the escalator, this fails with log messages like

2009-10-22 10:40:54 ERROR: Cannot create sends /var/tmp/01_1256208054_014710_001.  Skipping alert.  Escalator
Interface error: 2009-10-22 10:40:54 NOCpulse::Notif::FileQueue::skip (skip) file is /var/lib/notification/queue/alert_queue/01_1256208054_014710_001

The notif-escalator.log has corresponding messages

2009-10-22 10:40:54 Registered Alert [00004] /var/tmp/01_1256208054_014710_001
2009-10-22 10:40:54 Registering sends for Alert [00004] 01_1256208054_014710_001 send count = 0

So though the launcher is talking to the escalator, it isn't actually
registering any sends, because create_initial_sends is returning an
empty list. This leads to the confusing "Interface error" above even
though there's no actual interface error. I think notif-launcher should check
the length of @sends before calling register_sends.

Email still doesn't send though because ContactGroup->destinations is
empty. I traced this to the following line in
NOCpulse::Notif::ContactGroup:

  push(@{$self->destinations},$destination);

It's the same idiom that causes problems in FileQueue. When I replace
it with this:

  $self->push_destinations($destination);

restart monitoring and create an alert I get an email at last. 

Regards,

-- 
David Nutter  				Tel: +44 (0)131 650 4888
BioSS, JCMB, King's Buildings, Mayfield Rd, EH9 3JZ. Scotland, UK 

Biomathematics and Statistics Scotland (BioSS) is formally part of The
Scottish Crop Research Institute (SCRI), a registered Scottish charity
No. SC006662




More information about the Spacewalk-list mailing list