Server Monitoring - A replacement for Nagios?
Nigel Jones
dev at nigelj.com
Thu Jul 31 02:59:04 UTC 2008
Okay, so while this was intended to be a primary discussion point for
tomorrows Infrastructure meeting we had a little bit of discussion first
in #fedora-admin, and then in #fedora-meeting regarding Zabbix, a tool
like Nagios that I begun to setup for testing this week.
In summary the discussion ended positively we think it will do the job
quite well and we really need to now sit down and work out if we want to
try implementing it on a limited scale in parallel with Nagios (to act
as a comparison).
The related part of the agenda for tomorrow now will be:
* Do we want to push this into a limited trial (say 10 key-ish machines
in our infrastructure)
* How long would such a trial last for
* What are we going to use as a metric for such a trial
* Are there other concerns
Personally I'd like to see this as a step forward in revamping
sysadmin-noc so we can reduce the work load on members in sysadmin-main.
Review the log below.
-- Nigel
10:43 < mmcgrath> G: so whats your take on how big the zabbix db will
get? Should we put it on db1 or on its own box?
10:43 < mmcgrath> if its on its own (probably the same one zabbix is on)
we're lowering points of failure, but we might have to re-spec noc1 and
noc2.
10:43 < G> mmcgrath: I'm not sure
10:44 < G> this is where it's great to have people like wakko666 and
jcollie who use it atDAYJOB
10:44 < mmcgrath> yeah.
10:44 < G> maybe we need a nocdb1
10:44 < mmcgrath> If it needs to be pretty quick once it gets full of
stuff, we might justwant to put it on db1.
10:44 < mmcgrath> if it stays light though, we'll probably just keep it
localhost to noc1 and give noc1 more ram / disk space.
10:45 < wakko666> mmcgrath: the rate of growth for the zabbix DB
directly depends on the poll rates for all of the checks
10:45 < mmcgrath> wakko666: if you don't mind my asking.... how many
hosts do you have andhow big is the db?
10:45 < G> yeah, from what I can tell also, zabbix does it's own
housekeeping to try and consolidate some of the data
10:45 < mmcgrath> and how much stuff do you monitor? pretty default
stuff? or more then the default.
10:46 < wakko666> we have 50 hosts in production, and another 100 hosts
outside that across two zabbix nodes
10:47 < mmcgrath> wakko666: is the two zabbix nodes for high
availability or was it because one zabbix node couldn't handle the traffic?
10:47 < wakko666> it's because they're in different locations
10:47 < mmcgrath> <nod>
10:47 < mmcgrath> Do you have a dedicated db? How big is the raw database?
10:48 < fchiulli_> mmcgrath: I'm assuming that part of the discussion
will be whether to have more than one zabbix monitoring host.
10:48 < wakko666> mmcgrath: we've got a dedicated mysql db for each
node. the production data is currently around 10-20 GB, the
non-production node sits at around 40-50 GB
10:48 < wakko666> the key thing to note is that zabbix keeps data in two
forms, with tunable knobs for each.
10:48 < dgilmore> wakko666: over what time period?
10:48 < mmcgrath> wakko666: you don't happen to have sar data for those
hosts you could give to me would you? :)
10:49 < wakko666> dgilmore: we're at about 3-4 months right now
10:49 < mmcgrath> I suppose we can start out small and move it later...
its not really that big of a risk.
10:49 < wakko666> mmcgrath: unfortunately, today was my last day there.
i was "reorganized" out of a job. ;-)
10:49 < dgilmore> wakko666: so your anticipating up to 80gb for
production a year?
10:49 < mmcgrath> wakko666: doah, well... hope all is well.
10:50 < wakko666> dgilmore: sort of. as i was saying, there are two
knobs. poll data, and trend data.
10:50 < wakko666> typically, we keep all polled data for about 7 days
worth, then only keep trend data after that
10:50 < ricky> mmcgrath: IT's in now :-)
10:50 < dgilmore> much like cacti does
10:50 < mmcgrath> ricky: hilarious.
10:50 < wakko666> mmcgrath: yeah, i'll probably be fine. though, i
wouldn't mind findinga spot at RH. ;-)
10:51 < G> wakko666: wait a second, I thought if you setup multiple
nodes they could sharethe same tasks?
10:51 < mmcgrath> and, correct me if I'm wrong, but zabbix doesn't store
RRD right? the graphs come from the database?
10:51 < G> mmcgrath: correct from what I can tell
10:51 < wakko666> mmcgrath: correct. graphs are auto-generated, not
RRD. so you can create new graphs and they're autopopulated with old data
10:52 < wakko666> G: yes, nodes share the data from the tasks. the
zabbix-agent.conf and zabbix-server.conf help configure which node
performs the polling
10:52 < mmcgrath> wakko666: were you using auto-recovery services?
10:52 < G> wakko666: k, so it's one big db and you just assign hosts to
each node?
10:53 < wakko666> mmcgrath: auto-recovery? not sure what you mean.
perhaps you mean auto-discovery?
10:53 < G> wakko666: remote commands :)
10:53 < mmcgrath> wakko666: like if httpd dies on an app server, have
zabbix restart it.
10:53 < wakko666> G: can be, or you can set up a db per node, or db on
some nodes and not others. it's pretty flexible
10:54 < wakko666> mmcgrath: ah ha! yeah, you can have zabbix execute
commands on healthcheck failure
10:54 < wakko666> really, the big limitation of zabbix is a couple of things
10:54 < G> I'd like to see noc1/noc2 share the zabbix checks
10:54 < wakko666> currently, in 1.4, there's no repeated notifications.
one notify is allyou get.
10:54 < G> wakko666: yeah, I noticed that
10:54 < wakko666> (it's coming in 1.6, which is due in Sept)
10:54 < mmcgrath> G: yeah, I'm totally fine re-thinking how we have our
noc's setup. The big things I want are:
10:55 < mmcgrath> paged alerts when a service is not available.
10:55 < mmcgrath> and email alerts when an individual service in a farm
goes down.
10:55 < G> mmcgrath: yeah
10:55 < wakko666> mmcgrath: yup, no troubles doing those, and you'll
likely get finer granularity than with nagios
10:55 < mmcgrath> that got kind of tricky in one nagios instance.
10:55 < G> yep, exactly
10:56 < mmcgrath> well, and even tricker in one nagios instance in PHX :)
10:56 < mmcgrath> wakko666: if there's some services that noc1 can't get
to but noc2 can, can you tell zabbix to always check those with noc2?
10:56 < G> mmcgrath: the nice thing is, is that you can run the
zabbix-server on more thanone server, and the web interface on totally
different servers
10:57 < wakko666> yeah... with multiple nodes, you define checks per
node. so you'd configure a particular host on noc2's zabbix node.
10:57 < G> yeah, thats what we really want
10:57 < mmcgrath> yep.
10:57 < G> actually #fedora-meeting is free, shall we have an impromptu
there?
10:57 < wakko666> works for me.
10:58 < mmcgrath> G: sure
-- Discussion moved to #fedora-meeting --
10:58 -!- G changed the topic of #fedora-meeting to: sysadmin-noc -
System Monitoring Needs
10:58 < mmcgrath> W00t
10:58 < G> ricky: dgilmore: jcollie: you folks around?
10:58 < mmcgrath> G: so I want zabbix to monitor when new versions of my
packages are around, build them, and push them via bodhi when new
versions are out :)
10:58 * mmcgrath runs
10:59 < wakko666> lol
10:59 < G> mmcgrath: haha :)
10:59 < ricky> G: pongish
10:59 < G> okay, so if you open your hym books to
http://publictest3.fedoraproject.org/zabbix/overview.php we have a
basic-ish setup atm
11:00 < wakko666> looks like the basic Linux Server template...
11:00 < G> wakko666: yeah :)
11:00 < G> wakko666: except I started moving some of the specific checks
like apache into other templates and started linking them
11:00 < dgilmore> G: not really
11:01 < wakko666> G: that works. one suggestion: copy the default
graphs for Zabbix Server into the Linux Server template so you get some
default graphing for each host
11:01 < mmcgrath> G: any luck getting ahold of fchuili?
11:02 < G> argh, I meant to ping him back before
11:02 < G> dgilmore: no problem :)
11:02 < G> wakko666: ricky: you have accounts there now, irc nick/test
11:02 < mmcgrath> I'll drop him an email
11:03 < G> wakko666: they were in default settings iirc
11:03 < G> oh maybe not
11:03 < mmcgrath> G: you don't happen to know if we can plug this in to
FAS do you?
11:03 * dgilmore will note he tried zabbix
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannd founnnnnnnnnnnnnnnnnnnnd
ituseleeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeess hard to configure and
didnt work right
11:04 < G> wakko666: okay, done that now
11:04 * dgilmore wonders when ajax will gggget
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx
fied:)pretty please
11:04 < G> mmcgrath: I don't think so, kinda like cacti in a way
11:04 < mmcgrath> dgilmore: thats so funny, G set this up in a matter of
hours and have noproblems at all ;-)
11:05 < wakko666> i'm not sure about plugging the auth into FAS, but
it's PHP so at the very least it should be hackable
11:05 < dgilmore> mmcgrath: monitoring localhost worked
11:05 < dgilmore> mmcgrath: but that was it
11:05 < G>
http://publictest3.fedoraproject.org/zabbix/charts.php?period=86400&dec=0&inc=0&left=0&right=0&stime=yyyymmddhhmm&from=0&groupid=0&hostid=10017&graphid=5
11:05 < ricky> Worst case, we put it behind basic auth.
11:05 < mmcgrath> dgilmore: time for another look :)
11:05 < G> I like the stuff like that
11:05 < mmcgrath> ricky: yeah, thats what I was thinking
11:05 < dgilmore> mmcgrath: it was about 3 or 4 months ago i think
11:05 < G> dgilmore: I got 4 hosts monitored in no time, only trouble
was iptables on the pt machines :)
11:06 < wakko666> one note: stacked graphs occasionally don't render
quite right. sometimes zabbix leaves white space between data sets
11:06 < G> wakko666: yeah, but it still shows the trend quite nicely
11:06 < wakko666> G: agreed.
11:07 < wakko666> for the web servers, setting up app-specific web
checks is great, and fairly easy to do
11:07 < G> hmmm whats on PT7, it takes a bit of beating
11:08 < G>
http://publictest3.fedoraproject.org/zabbix/charts.php?period=43200&dec=0&inc=0&left=0&right=0&stime=yyyymmddhhmm&from=0&groupid=0&hostid=10026&graphid=31
11:08 < G> okay, so I think the first thing is:
11:08 < G> What are our requirements?
11:08 < ricky> pt7 looks fine to me
11:08 < G> I can easily add the following:
11:08 < ricky> Ah, it's back in the green now.
11:09 < G> -> Equal checking abilities to nagios (i.e. the type of checks)
11:09 < ricky> Could you walk us through the processes of adding a
complex check?
11:09 < G> -> Ability to send out e-mails/pagers
11:09 < ricky> And also, is there any sort of equivalent screen to
https://admin.fedoraproject.org/nagios/cgi-bin//status.cgi?host=all&servicestatustypes=28&hoststatustypes=15
in nagios?
11:09 < mmcgrath> and what is the difference between a "web" check and
just your normal check?
11:09 < mmcgrath> how do we write custom plugins?
11:09 < mmcgrath> why is the sky blue?
11:09 < G> -> Ability to customise stuff
11:10 < ricky> (As little information as possible - a view with *just*
what problems are going on)
11:10 < G> ricky: yes
11:10 < wakko666> mmcgrath: by 'web check' i mean, a semi-intelligent
check of a web-app, where you can set up a series of steps for it to
check through such as "hit koji.fp.org,click packages, click builds, etc"
11:10 < G>
http://publictest3.fedoraproject.org/zabbix/tr_status.php?onlytrue=true&noactions=false&compact=false&select=false&txt_select=&sort=priority
11:10 < dgilmore> can i just edit a nice easy to read config file to do
things?
11:11 < G> dgilmore: and break it while you try to work out why it broke
11:11 < wakko666> dgilmore: all config is done through the zabbix web gui
11:11 < G> errr
11:11 < G> and break it and spend ages working out why you broke it
11:11 < wakko666> ricky: the zabbix equivalent to that nagios screen is
the Monitoring ->Overview screen, though under Screens, you can set up a
customized view as well.
11:12 < dgilmore> wakko666: to me thats really bad
11:12 < G> wakko666: the triggers = true page is like that too
11:12 < G> (the link I pasted just before)
11:12 * dgilmore personally doesnt like configuring though a web gui.
maybe why zabbix did not work out for me
11:12 < wakko666> dgilmore: it's a different paradigm. i don't equate
different to bad. not having config files doesn't strike me as a flaw.
11:13 < abadger1999> wakko666: makes it harder to manage it via puppet.
11:14 < wakko666> abadger1999: yes and no. there are config files for
the polling server daemon and the client-side agent. at $dayjob, i
push the agent configs via puppet
11:14 < abadger1999> <nod.
11:15 < ricky> I guess abadger1999 was referring to things like
configuration for specificchecks and things like that
11:15 < G> wakko666: custom checks are defined in the agent config right?
11:15 < wakko666> to me, the big thing with zabbix is that it's
essential to back up the db, and export your configs on a regular
basis. it's painful to spend hours setting zabbix up, and have your db
get corrupted and have to do all that work all over again
11:16 < ricky> So the exciting question: What problems that we're seeing
with nagios does zabbix solve?
11:16 < wakko666> G: custom checks can be one of two things. custom
zabbix-agent checks,and zabbix server-side remote checks
11:16 < G> wakko666: oh thats extra nifyt
11:16 < ricky> One thing is combining cacti functionality - what else?
11:16 < G> *nifty
11:16 < G> ricky: distributed monitoring :)
11:17 < ricky> Can you elaborate a bit? :-)
11:17 < G> and has Brett pointed out before, complex checks
11:17 < wakko666> for me, zabbix does templates and rapid configuration
of new hosts significantly better than nagios
11:17 < G> errr complex web checks
11:17 < G> yeah, the templating looks _REALLY_ good
11:18 < wakko666> zabbix also is more granular than both cacti and
nagios. the default network traffic checks are done every 5 seconds
11:18 < mmcgrath> G: I take it it has similar workflow that nagios has?
(not that we usedit?)
11:18 < G> build a profile of the typical application server apply the
template to all theapp servers and your home free
11:18 < ricky> Do you have a link where I can see the templating
coolness in action?
11:18 < mmcgrath> but outage happens, someone ack's it and starts working?
11:18 < G> mmcgrath: ack etc? yeah
11:18 < f13> darn, I have to leave, but I'm really interested in what
platform wins out. Particularly interested in zenoss vs zabbix
11:18 < ricky> Because right now, I'm visualizing hostgroups in nagios
11:18 < wakko666> mmcgrath: yes. same basic workflow
11:19 < wakko666> f13: i vote zabbix over zenoss simply because zabbix
doesn't use rpath
11:19 < ricky> f13: zenoss = zope :-(
11:19 < mmcgrath> wakko666: G: how hard is it to script outages?
11:19 < G> that'd be something brett would have to answer
11:20 < f13> wakko666: there is that.
11:20 < f13> ricky: good point.
11:20 < f13> zenoss had something going for it in that previous
cacti/nagios stuff would work with it, or so was the claim
11:20 < wakko666> outages are the one thing about zabbix that is a bit
unclear to me. i think the best analogue is to disable monitoring (a
single drop-down box), or to acknowledgethe alert
11:21 * ricky still hasn't figured out where he can see templates
11:21 < wakko666> being that zabbix doesn't do repeated alerts, you'll
only get a single "down" page anyway...
11:21 < mmcgrath> wakko666: as in its difficult to schedule an outage
ahead of time?
11:21 < G> ricky:
http://publictest3.fedoraproject.org/zabbix/hosts.php?groupid=0&config=2
11:21 < wakko666> ricky: Configuration > Items or Triggers. there's a
Template drop-down
11:21 < ricky> Aha
11:22 < wakko666> mmcgrath: yeah, basically. as far as i've seen,
zabbix doesn't yet havethe concept of scheduled outages. a service is
either up or down, and not much beyond that
11:23 < wakko666> i suspect that may be on their todo list for the next
version, though
11:23 < ricky> So where can I see the linkage between a template and the
checks for that template?
11:23 < G> I don't think it's an exact issue
11:23 < G> ricky: Items
11:23 < jcollie> you could always shut down the zabbix server :)
11:24 < wakko666> ricky: the expression column will have the template
name in it
11:24 < mmcgrath> I've only looked a little bit but... how well does
service deps work?
11:24 < wakko666> ricky: err... not expression column... the name column.
11:24 < ricky> I think I got it
11:24 < wakko666> mmcgrath: dependencies are dead easy.
11:24 < G> mmcgrath: it'd appear you can add multiple dependences per
trigger
11:25 < G>
http://publictest3.fedoraproject.org/zabbix/triggers.php?form=update&triggerid=10043&hostid=10001
11:25 < wakko666> if you check apache on host A, but that check goes
through router B, youadd a dependency on the apache check so that the
check doesn't execute unless the checks for router B are passing.
11:28 < mmcgrath> So really
11:29 < mmcgrath> G: how about this... We give it a quick talk tomorrow
at the meeting there. If there's no blockers or major opposition. We
get it on noc1 and get to work?
11:29 < G> mmcgrath: so your happy with what I've done on pt3 so far?
11:30 < mmcgrath> Yeah so far. I'd like to see it monitoring a couple
of things along side nagios, both sending notifications, and see how it
does in production.
11:30 < mmcgrath> so not spending a ton of time on it, but monitoring a
few critical bits that frequently have problems.
11:30 < G> in that case sure, except if we are putting into production,
I guess we should grab Jeff's 0.4.6 update and put it in f-i until it
appears in epel
11:31 < G> I'll be happy to lead that task
11:31 < G> wakko666: jcollie: you both in sysadmin-noc?
11:31 < mmcgrath> G: excellent.
11:31 < wakko666> G: applying now. :)
11:31 < G> I'll sponsor you :)
11:32 < wakko666> yay! :-)
11:32 < G> mmcgrath: I think we'll leave the internal authentication for
now, I'll leave the main part readable by everyone, and add accounts for
everyone in sysadmin-main/noc thatsactive
11:33 < G> wakko666: done
11:33 < mmcgrath> G: thats fine.
11:34 < G> okay, so adjourned until the inframeeting 2000UTC tomorrow :)
11:34 < ricky> How can I trigger a check?
11:34 < wakko666> ricky: turn off the service that it's checking. ;-)
11:34 < ricky> Oh.
11:35 < wakko666> you can also just flip the logic of the trigger.
11:35 -!- G changed the topic of #fedora-meeting to: Channel is used by
various Fedora groups and committees for their regular meetings | Note
that meetings often get logged | For questions about using Fedora please
ask in #fedora | See
http://fedoraproject.org/wiki/Communicate/FedoraMeetingChannel for
meeting schedule
11:36 < G> I'll post a log to the infra-list soon so people can have a
read before the main meeting
11:37 < mmcgrath> G: good ide
11:37 < mmcgrath> a
More information about the Fedora-infrastructure-list
mailing list