Server Monitoring - A replacement for Nagios?

Okay, so while this was intended to be a primary discussion point for tomorrows Infrastructure meeting we had a little bit of discussion first in #fedora-admin, and then in #fedora-meeting regarding Zabbix, a tool like Nagios that I begun to setup for testing this week.

In summary the discussion ended positively we think it will do the job quite well and we really need to now sit down and work out if we want to try implementing it on a limited scale in parallel with Nagios (to act as a comparison).

The related part of the agenda for tomorrow now will be:
* Do we want to push this into a limited trial (say 10 key-ish machines in our infrastructure)
* How long would such a trial last for
* What are we going to use as a metric for such a trial
* Are there other concerns

Personally I'd like to see this as a step forward in revamping sysadmin-noc so we can reduce the work load on members in sysadmin-main.

Review the log below.

-- Nigel

10:43 < mmcgrath> G: so whats your take on how big the zabbix db will get? Should we put it on db1 or on its own box? 10:43 < mmcgrath> if its on its own (probably the same one zabbix is on) we're lowering points of failure, but we might have to re-spec noc1 and noc2.
10:43 < G> mmcgrath: I'm not sure
10:44 < G> this is where it's great to have people like wakko666 and jcollie who use it atDAYJOB
10:44 < mmcgrath> yeah.
10:44 < G> maybe we need a nocdb1
10:44 < mmcgrath> If it needs to be pretty quick once it gets full of stuff, we might justwant to put it on db1. 10:44 < mmcgrath> if it stays light though, we'll probably just keep it localhost to noc1 and give noc1 more ram / disk space. 10:45 < wakko666> mmcgrath: the rate of growth for the zabbix DB directly depends on the poll rates for all of the checks 10:45 < mmcgrath> wakko666: if you don't mind my asking.... how many hosts do you have andhow big is the db? 10:45 < G> yeah, from what I can tell also, zabbix does it's own housekeeping to try and consolidate some of the data 10:45 < mmcgrath> and how much stuff do you monitor? pretty default stuff? or more then the default. 10:46 < wakko666> we have 50 hosts in production, and another 100 hosts outside that across two zabbix nodes 10:47 < mmcgrath> wakko666: is the two zabbix nodes for high availability or was it because one zabbix node couldn't handle the traffic?
10:47 < wakko666> it's because they're in different locations
10:47 < mmcgrath> <nod>
10:47 < mmcgrath> Do you have a dedicated db?  How big is the raw database?
10:48 < fchiulli_> mmcgrath: I'm assuming that part of the discussion will be whether to have more than one zabbix monitoring host. 10:48 < wakko666> mmcgrath: we've got a dedicated mysql db for each node. the production data is currently around 10-20 GB, the non-production node sits at around 40-50 GB 10:48 < wakko666> the key thing to note is that zabbix keeps data in two forms, with tunable knobs for each.
10:48 < dgilmore> wakko666: over what time period?
10:48 < mmcgrath> wakko666: you don't happen to have sar data for those hosts you could give to me would you? :)
10:49 < wakko666> dgilmore: we're at about 3-4 months right now
10:49 < mmcgrath> I suppose we can start out small and move it later... its not really that big of a risk. 10:49 < wakko666> mmcgrath: unfortunately, today was my last day there. i was "reorganized" out of a job. ;-) 10:49 < dgilmore> wakko666: so your anticipating up to 80gb for production a year?
10:49 < mmcgrath> wakko666: doah, well... hope all is well.
10:50 < wakko666> dgilmore: sort of. as i was saying, there are two knobs. poll data, and trend data. 10:50 < wakko666> typically, we keep all polled data for about 7 days worth, then only keep trend data after that
10:50 < ricky> mmcgrath: IT's in now :-)
10:50 < dgilmore> much like cacti does
10:50 < mmcgrath> ricky: hilarious.
10:50 < wakko666> mmcgrath: yeah, i'll probably be fine. though, i wouldn't mind findinga spot at RH. ;-) 10:51 < G> wakko666: wait a second, I thought if you setup multiple nodes they could sharethe same tasks? 10:51 < mmcgrath> and, correct me if I'm wrong, but zabbix doesn't store RRD right? the graphs come from the database?
10:51 < G> mmcgrath: correct from what I can tell
10:51 < wakko666> mmcgrath: correct. graphs are auto-generated, not RRD. so you can create new graphs and they're autopopulated with old data 10:52 < wakko666> G: yes, nodes share the data from the tasks. the zabbix-agent.conf and zabbix-server.conf help configure which node performs the polling
10:52 < mmcgrath> wakko666: were you using auto-recovery services?
10:52 < G> wakko666: k, so it's one big db and you just assign hosts to each node? 10:53 < wakko666> mmcgrath: auto-recovery? not sure what you mean. perhaps you mean auto-discovery?
10:53 < G> wakko666: remote commands :)
10:53 < mmcgrath> wakko666: like if httpd dies on an app server, have zabbix restart it. 10:53 < wakko666> G: can be, or you can set up a db per node, or db on some nodes and not others. it's pretty flexible 10:54 < wakko666> mmcgrath: ah ha! yeah, you can have zabbix execute commands on healthcheck failure
10:54 < wakko666> really, the big limitation of zabbix is a couple of things
10:54 < G> I'd like to see noc1/noc2 share the zabbix checks
10:54 < wakko666> currently, in 1.4, there's no repeated notifications. one notify is allyou get.
10:54 < G> wakko666: yeah, I noticed that
10:54 < wakko666> (it's coming in 1.6, which is due in Sept)
10:54 < mmcgrath> G: yeah, I'm totally fine re-thinking how we have our noc's setup. The big things I want are:
10:55 < mmcgrath> paged alerts when a service is not available.
10:55 < mmcgrath> and email alerts when an individual service in a farm goes down.
10:55 < G> mmcgrath: yeah
10:55 < wakko666> mmcgrath: yup, no troubles doing those, and you'll likely get finer granularity than with nagios
10:55 < mmcgrath> that got kind of tricky in one nagios instance.
10:55 < G> yep, exactly
10:56 < mmcgrath> well, and even tricker in one nagios instance in PHX :)
10:56 < mmcgrath> wakko666: if there's some services that noc1 can't get to but noc2 can, can you tell zabbix to always check those with noc2? 10:56 < G> mmcgrath: the nice thing is, is that you can run the zabbix-server on more thanone server, and the web interface on totally different servers 10:57 < wakko666> yeah... with multiple nodes, you define checks per node. so you'd configure a particular host on noc2's zabbix node.
10:57 < G> yeah, thats what we really want
10:57 < mmcgrath> yep.
10:57 < G> actually #fedora-meeting is free, shall we have an impromptu there?
10:57 < wakko666> works for me.
10:58 < mmcgrath> G: sure

-- Discussion moved to #fedora-meeting --

10:58 -!- G changed the topic of #fedora-meeting to: sysadmin-noc - System Monitoring Needs
10:58 < mmcgrath> W00t
10:58 < G> ricky: dgilmore: jcollie: you folks around?
10:58 < mmcgrath> G: so I want zabbix to monitor when new versions of my packages are around, build them, and push them via bodhi when new versions are out :)
10:58  * mmcgrath runs
10:59 < wakko666> lol
10:59 < G> mmcgrath: haha :)
10:59 < ricky> G: pongish
10:59 < G> okay, so if you open your hym books to http://publictest3.fedoraproject.org/zabbix/overview.php we have a basic-ish setup atm
11:00 < wakko666> looks like the basic Linux Server template...
11:00 < G> wakko666: yeah :)
11:00 < G> wakko666: except I started moving some of the specific checks like apache into other templates and started linking them
11:00 < dgilmore> G: not really
11:01 < wakko666> G: that works. one suggestion: copy the default graphs for Zabbix Server into the Linux Server template so you get some default graphing for each host
11:01 < mmcgrath> G: any luck getting ahold of fchuili?
11:02 < G> argh, I meant to ping him back before
11:02 < G> dgilmore: no problem :)
11:02 < G> wakko666: ricky: you have accounts there now, irc nick/test
11:02 < mmcgrath> I'll drop him an email
11:03 < G> wakko666: they were in default settings iirc
11:03 < G> oh maybe not
11:03 < mmcgrath> G: you don't happen to know if we can plug this in to FAS do you? 11:03 * dgilmore will note he tried zabbix aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannd founnnnnnnnnnnnnnnnnnnnd ituseleeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeess hard to configure and didnt work right
11:04 < G> wakko666: okay, done that now
11:04 < G> mmcgrath: I don't think so, kinda like cacti in a way
11:04 < mmcgrath> dgilmore: thats so funny, G set this up in a matter of hours and have noproblems at all ;-) 11:05 < wakko666> i'm not sure about plugging the auth into FAS, but it's PHP so at the very least it should be hackable
11:05 < dgilmore> mmcgrath: monitoring localhost worked
11:05 < dgilmore> mmcgrath: but that was it
11:05 < G> http://publictest3.fedoraproject.org/zabbix/charts.php?period=86400&dec=0&inc=0&left=0&right=0&stime=yyyymmddhhmm&from=0&groupid=0&hostid=10017&graphid=5
11:05 < ricky> Worst case, we put it behind basic auth.
11:05 < mmcgrath> dgilmore: time for another look :)
11:05 < G> I like the stuff like that
11:05 < mmcgrath> ricky: yeah, thats what I was thinking
11:05 < dgilmore> mmcgrath: it was about 3 or 4 months ago i think
11:05 < G> dgilmore: I got 4 hosts monitored in no time, only trouble was iptables on the pt machines :) 11:06 < wakko666> one note: stacked graphs occasionally don't render quite right. sometimes zabbix leaves white space between data sets
11:06 < G> wakko666: yeah, but it still shows the trend quite nicely
11:06 < wakko666> G: agreed.
11:07 < wakko666> for the web servers, setting up app-specific web checks is great, and fairly easy to do
11:07 < G> hmmm whats on PT7, it takes a bit of beating
11:08 < G> http://publictest3.fedoraproject.org/zabbix/charts.php?period=43200&dec=0&inc=0&left=0&right=0&stime=yyyymmddhhmm&from=0&groupid=0&hostid=10026&graphid=31
11:08 < G> okay, so I think the first thing is:
11:08 < G> What are our requirements?
11:08 < ricky> pt7 looks fine to me
11:08 < G> I can easily add the following:
11:08 < ricky> Ah, it's back in the green now.
11:09 < G> -> Equal checking abilities to nagios (i.e. the type of checks)
11:09 < ricky> Could you walk us through the processes of adding a complex check?
11:09 < G> -> Ability to send out e-mails/pagers
11:09 < ricky> And also, is there any sort of equivalent screen to https://admin.fedoraproject.org/nagios/cgi-bin//status.cgi?host=all&servicestatustypes=28&hoststatustypes=15 in nagios? 11:09 < mmcgrath> and what is the difference between a "web" check and just your normal check?
11:09 < mmcgrath> how do we write custom plugins?
11:09 < mmcgrath> why is the sky blue?
11:09 < G> -> Ability to customise stuff
11:10 < ricky> (As little information as possible - a view with *just* what problems are going on)
11:10 < G> ricky: yes
11:10 < wakko666> mmcgrath: by 'web check' i mean, a semi-intelligent check of a web-app, where you can set up a series of steps for it to check through such as "hit koji.fp.org,click packages, click builds, etc" 11:10 < G> http://publictest3.fedoraproject.org/zabbix/tr_status.php?onlytrue=true&noactions=false&compact=false&select=false&txt_select=&sort=priority 11:10 < dgilmore> can i just edit a nice easy to read config file to do things?
11:11 < G> dgilmore: and break it while you try to work out why it broke
11:11 < wakko666> dgilmore:  all config is done through the zabbix web gui
11:11 < G> errr
11:11 < G> and break it and spend ages working out why you broke it
11:11 < wakko666> ricky: the zabbix equivalent to that nagios screen is the Monitoring ->Overview screen, though under Screens, you can set up a customized view as well.
11:12 < dgilmore> wakko666: to me thats really bad
11:12 < G> wakko666: the triggers = true page is like that too
11:12 < G> (the link I pasted just before)
11:12 * dgilmore personally doesnt like configuring though a web gui. maybe why zabbix did not work out for me 11:12 < wakko666> dgilmore: it's a different paradigm. i don't equate different to bad. not having config files doesn't strike me as a flaw.
11:13 < abadger1999> wakko666: makes it harder to manage it via puppet.
11:14 < wakko666> abadger1999: yes and no. there are config files for the polling server daemon and the client-side agent. at $dayjob, i push the agent configs via puppet
11:14 < abadger1999> <nod.
11:15 < ricky> I guess abadger1999 was referring to things like configuration for specificchecks and things like that
11:15 < G> wakko666: custom checks are defined in the agent config right?
11:15 < wakko666> to me, the big thing with zabbix is that it's essential to back up the db, and export your configs on a regular basis. it's painful to spend hours setting zabbix up, and have your db get corrupted and have to do all that work all over again 11:16 < ricky> So the exciting question: What problems that we're seeing with nagios does zabbix solve? 11:16 < wakko666> G: custom checks can be one of two things. custom zabbix-agent checks,and zabbix server-side remote checks
11:16 < G> wakko666: oh thats extra nifyt
11:16 < ricky> One thing is combining cacti functionality - what else?
11:16 < G> *nifty
11:16 < G> ricky: distributed monitoring :)
11:17 < ricky> Can you elaborate a bit? :-)
11:17 < G> and has Brett pointed out before, complex checks
11:17 < wakko666> for me, zabbix does templates and rapid configuration of new hosts significantly better than nagios
11:17 < G> errr complex web checks
11:17 < G> yeah, the templating looks _REALLY_ good
11:18 < wakko666> zabbix also is more granular than both cacti and nagios. the default network traffic checks are done every 5 seconds 11:18 < mmcgrath> G: I take it it has similar workflow that nagios has? (not that we usedit?) 11:18 < G> build a profile of the typical application server apply the template to all theapp servers and your home free 11:18 < ricky> Do you have a link where I can see the templating coolness in action?
11:18 < mmcgrath> but outage happens, someone ack's it and starts working?
11:18 < G> mmcgrath: ack etc? yeah
11:18 < f13> darn, I have to leave, but I'm really interested in what platform wins out. Particularly interested in zenoss vs zabbix
11:18 < ricky> Because right now, I'm visualizing hostgroups in nagios
11:18 < wakko666> mmcgrath: yes. same basic workflow
11:19 < wakko666> f13: i vote zabbix over zenoss simply because zabbix doesn't use rpath
11:19 < ricky> f13: zenoss = zope :-(
11:19 < mmcgrath> wakko666: G: how hard is it to script outages?
11:19 < G> that'd be something brett would have to answer
11:20 < f13> wakko666: there is that.
11:20 < f13> ricky: good point.
11:20 < f13> zenoss had something going for it in that previous cacti/nagios stuff would work with it, or so was the claim 11:20 < wakko666> outages are the one thing about zabbix that is a bit unclear to me. i think the best analogue is to disable monitoring (a single drop-down box), or to acknowledgethe alert
11:21  * ricky still hasn't figured out where he can see templates
11:21 < wakko666> being that zabbix doesn't do repeated alerts, you'll only get a single "down" page anyway... 11:21 < mmcgrath> wakko666: as in its difficult to schedule an outage ahead of time? 11:21 < G> ricky: http://publictest3.fedoraproject.org/zabbix/hosts.php?groupid=0&config=2 11:21 < wakko666> ricky: Configuration > Items or Triggers. there's a Template drop-down
11:21 < ricky> Aha
11:22 < wakko666> mmcgrath: yeah, basically. as far as i've seen, zabbix doesn't yet havethe concept of scheduled outages. a service is either up or down, and not much beyond that 11:23 < wakko666> i suspect that may be on their todo list for the next version, though 11:23 < ricky> So where can I see the linkage between a template and the checks for that template?
11:23 < G> I don't think it's an exact issue
11:23 < G> ricky: Items
11:23 < jcollie> you could always shut down the zabbix server :)
11:24 < wakko666> ricky: the expression column will have the template name in it 11:24 < mmcgrath> I've only looked a little bit but... how well does service deps work?
11:24 < wakko666> ricky:  err... not expression column... the name column.
11:24 < ricky> I think I got it
11:24 < wakko666> mmcgrath: dependencies are dead easy.
11:24 < G> mmcgrath: it'd appear you can add multiple dependences per trigger 11:25 < G> http://publictest3.fedoraproject.org/zabbix/triggers.php?form=update&triggerid=10043&hostid=10001 11:25 < wakko666> if you check apache on host A, but that check goes through router B, youadd a dependency on the apache check so that the check doesn't execute unless the checks for router B are passing.
11:28 < mmcgrath> So really
11:29 < mmcgrath> G: how about this... We give it a quick talk tomorrow at the meeting there. If there's no blockers or major opposition. We get it on noc1 and get to work?
11:29 < G> mmcgrath: so your happy with what I've done on pt3 so far?
11:30 < mmcgrath> Yeah so far. I'd like to see it monitoring a couple of things along side nagios, both sending notifications, and see how it does in production. 11:30 < mmcgrath> so not spending a ton of time on it, but monitoring a few critical bits that frequently have problems. 11:30 < G> in that case sure, except if we are putting into production, I guess we should grab Jeff's 0.4.6 update and put it in f-i until it appears in epel
11:31 < G> I'll be happy to lead that task
11:31 < G> wakko666: jcollie: you both in sysadmin-noc?
11:31 < mmcgrath> G: excellent.
11:31 < wakko666> G: applying now. :)
11:31 < G> I'll sponsor you :)
11:32 < wakko666>  yay!  :-)
11:32 < G> mmcgrath: I think we'll leave the internal authentication for now, I'll leave the main part readable by everyone, and add accounts for everyone in sysadmin-main/noc thatsactive
11:33 < G> wakko666: done
11:33 < mmcgrath> G: thats fine.
11:34 < G> okay, so adjourned until the inframeeting 2000UTC tomorrow :)
11:34 < ricky> How can I trigger a check?
11:34 < wakko666> ricky:  turn off the service that it's checking.   ;-)
11:34 < ricky> Oh.
11:35 < wakko666> you can also just flip the logic of the trigger.
11:35 -!- G changed the topic of #fedora-meeting to: Channel is used by various Fedora groups and committees for their regular meetings | Note that meetings often get logged | For questions about using Fedora please ask in #fedora | See http://fedoraproject.org/wiki/Communicate/FedoraMeetingChannel for meeting schedule 11:36 < G> I'll post a log to the infra-list soon so people can have a read before the main meeting
11:37 < mmcgrath> G: good ide
11:37 < mmcgrath> a

