[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [scl.org] softwarecollections.org offline - post mortem



For all those who are curious what happened to softwarecollections.org - here is timeline and description of the problem:

All times are in CEST timezone (which is UTC+2)

2015-08-15 14:22 - OOM killer is starting killing processes (mostly http and createrepo_c). Why? I do not know yet,
probably some bug in our application, we will investigate it as soon as possible.
2015-08-15 18:28 - monitoring reports that web server is down. But since this is not marked as critical service and it
happen during weekend it have to wait till Monday.
2015-08-17 ~9:00 - I am starting to investigate what happen. Ports are open, but all communication timeouts. I'm trying
to get somebody who will try to restart that machine.
This machine is hosted by BlueHost, which donated this machine and connectivity for free. I am talking to Support via
WebChat and describing the problem. The person on the other side demands some kind of authentication (login and some
letters from password). But I have no such credentials as this server was donated out of normal process and we got no
such credentials. Just server with our ssh keys. I am redirected to support bluehost com 
Concurrently I'm trying to notice JSmith, who is my only point of contact at BlueHost, but since he is in different TZ
and he can be on vacation (very likely during summer) I'm placing my bet on BlueHost support.
2015-08-17 10:16 - I got automatic reply from support bluehost com that this email is discontinued and I should open my
ticket on BlueHost website. Few moments later I found that I need account to log to ticket system (sounds like Catch-22).
2015-08-17 10:53 - I used my personal credit card and I'm paying for the cheapest service of Bluehost so I can get some
credentials (still it is $86).
2015-08-17 10:57 - I filed ticket HJZ-99465-311
2015-08-17 15:21 - JSmith responded over IRC, I'm describing the situation, he noticed some admin
2015-08-17 17:05 - I'm heading home from office, no updates from BlueHost or JSmith so far
2015-08-17 20:10 - server was restarted, for unknown reason httpd service is not enable therefore it did not start
automatically after reboot.
2015-08-17 9:27 - I enabled and started httpd service and softwarecollection.org is back online

This incident revealed several problems on both Red Hat and BlueHost side and in upcoming days we will work on our
processes so this will not happen again.
I'm sorry for the problems it caused you.

-- 
Miroslav Suchy, RHCA
Red Hat, Senior Software Engineer, #brno, #devexp, #fedora-buildsys


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]