[Ovirt-devel] [R&D] Breaking the Browser

Mon Jul 7 05:46:43 UTC 2008

On Sun, Jul 6, 2008 at 7:26 PM, mark wagner <mwagner at redhat.com> wrote:
> Note sort of a combinational reply to several responses in this thread....
Thanks again for keeping this civil. The only reason I asked to end this thread
was to prevent it from turning into a flamefest.

> Jeff Schroeder wrote:
>>
>> On Fri, Jul 4, 2008 at 7:33 PM, mark wagner <mwagner at redhat.com> wrote:
>>>
>>> Jeff Schroeder wrote:
>>>>
>>>> On Thu, Jul 3, 2008 at 2:14 PM, Jason Guiditta <jguiditt at redhat.com>
>>>> wrote:
>
>
>>> So if your job is to monitor the 25 hosts in your pool and take immediate
>>> and
>>> effective action to mitigate any performance or catastrophic issues in
>>> said
>>> pool within one minute and 26 secs (3 nines, assuming this is the only
>>> issue
>>> that day) of the event, how are you going to monitor your pool ?
>>
>> That is what monitoring software is for. If you expect someone to _look_
>> at
>> a webgui 24/7 you are approaching the problem from the absolute wrong
>> angle.
>>
>>> If we don't provide the ability to notify immediately when a host goes
>>> down,
>>> you are rapidly eating into time specified to resolve a problem.
>> You are agreeing with me on this one. This is what monitoring software is
>> for,
>> not a website. Something like nagios could call a pager or run a script to
>> do
>> something much faster than a human could.
>>
>
> So I guess the question to people is why not just make the decision to
> switch
> over to Nagios and something like Cacti now ?
> We all seem to agree that we need the Nagios type of functionality, Cacti or
> something should be able to solve our graphing requirements as well.
If you've ever looked over the cacti codebase you might want to go let out
aggression on an inanimate object afterwards. Apache + php would add a lot
of dependencies to the management node appliance also. +1 for nagios, but
you might be better off to roll your own small graphing libs in ruby or whatnot.

>>> We need to look at making the nav bar have near realtime capabilities.
>>> I need to know if a system is getting close to capacity (change the color
>>> of the icon?) or is offline.
Key word being near-realtime like I stated in an earlier thread. Even
if it really
only polls, gmail does a pretty good job of something similar to oVirt.

>> Changing the color of the icon would be good, but the admin should be able
>> to set threshholds. When X reaches X%, send alert to foo at bar.com or run
>> script /usr/local/bin/foo-alert.sh
>>
>
>>> How do the admins get notified as quickly as possible?
>>> Do they sit in front of a terminal and hit refresh on there browser?
>>> Set their mail clients to fetch every 10 secs ?
>>> This is clearly a case where time is money, if you are in a big NOC and
>>> miss your SLA it could cost big bucks, not to mention a job or two.

This is clearly a case where you've not sat in an admin's shoes. If
the only tool
you give a good admin to monitor his hosts is a web-ui, he is going to
crack open
perl LWP or python and scrape it every few seconds. He's also going to say nasty
things to you for only giving him a webui to work with.

Trying to reinforce the idea again, a web ui should never be relied on
for realtime
issues or alerts. Give a good api (commandline or web services) and
the admin will
deal with that. Good admins are too lazy to hit refresh and will write
a script to
do that for them. This is my day job and has been for some time.

>> Mark, without causing any bad blood, this conversation is basicly over.
>> You
>> agree with me that notification should happen as soon as possible. You
>> disagree
>> with me on how it should be done. For true autonomy, the human aspect
>> needs to be taken out of the picture as much as possible. This is why I'd
>> argue a script should automatically do alerting and NOT a human. What
>> the human does with that alert is up to the business.
>>
>
> Jeff, no bad blood thoughts here, this is an open, professional discussion
> aimed at making ovirt as good is can.  My goal is to provoke discussion
> to make sure that we have considered all angles that we can and the
> ramifications of any decision.
>
> I agree that the human interaction needs to be minimized as quickly and as
> much as possible.  However, I don't know that anyone is even considering
> how to add that at this point in time. (someone jump in if I'm wrong here)
>
> I still think there will be cases where things are not automated and some
> interaction needs to get done by hand.  Thus, even though we agree on
> functionality we need, I'm not sure that the discussion should be over yet.
> For instance, I think that we should see what it takes to update the
> NavBar as soon a possible with state changes.  If I start a guest or host
> manually, I would want to see when it is available ASAP w/o the need for
> my pager to go off or an email to arrive. The logical place to me is in the
> NavBar.

While still trying to be constructive, is the difference between 5-10
seconds(poll)
and a few milliseconds (realtime) honestly going to matter? From a systems
administrator standpoint, I'm still unconvinced what this buys you. I've worked
at fortune 500 and 200 companies with some pretty complex uptime SLAs.
This still boggles my mind. Can you give an actual use case where this
is required?

Putting my admin hat on, starting a service is a good event and should
*never* alert.
Stopping, restarting, or abnormal shutdown of hosts should alert. If
you drown out the
bad things with white noise important things will get glanced over.

> My logic is that if I'm starting it via the WUI, there is a reasonable
> chance I'm going to do something else with it. In the case of the host,
> I may move some guests to it. As a user, if there is no immediate
> notification,
> I'll be hitting refresh to see when its available so I can continue my work.

You should never have to refresh a proper web-2.0 application. Again, if we
can strive to be somewhat similar to gmail with near-realtime updates, I think
it will be "good enough". If you disagree and want to send in patches more power
to you. Can you give me an actual use case of the above scenario though? Is it
something along the lines of an admin starting a host and moving vms on it and
a user waiting to play with those vms? In 8+ years of doing this kind
of work I've
not seen something like this where it absolutely must be realtime. This seems
niche and overkill for 95% of the cases. Feel free to prove otherwise.

> I will admit that I do not work as an admin. I do sometimes play one when
> running performance tests, etc. Our solution is very low-tech, we poll
> until the systems are up and then monitor the tests. If "industry standard
> practice" for a NOC is to use pagers then my argument may be irrelevant
> to our "target audience" and us lab weenies can wear out the refresh button
> :)
I'm trying to move from an admin to more of a "systems developer"
roll. Why not just
poll more and make the data transferred very small? Here is a quick
idea from a current
sleep deprived state, please comment and or rip it to shreds:

Have a _very small_ bit of json with a few variables for each host.
When a new host comes
up, it can add itsself to an object somewhere that a json feed creator
can access. If a user is
logged in with sufficient access to see host information, a json file
is created with the hosts
he/she is allowed to view. The json can sit in memory and so polling
it shouldn't be a huge deal.
If another user is logged in with access to the exact same hosts, the
ruby just includes the exact
same file / object or whatever and uses the same already-cached json.

If this was the "enterprise deployment" of oVirt with everything split
up onto different hosts, this
would still scale pretty well on a low mid-range server like a dl360
or whatnot. Kind of cheesy,
but what do you think? And for the "too much polling kills the server"
argument, you split it up
into > 1 ovirt server and have them do realtime communication (AMQP)
between each other.
Network would be the main bottleneck in this design, but it rarely is
the bottleneck in "enterprises".

Thoughts?

>
> -mark
>

-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com