Outages and Koji

Mike McGrath mmcgrath at redhat.com
Sat Mar 22 16:17:13 UTC 2008


We just had the longest buildsys outage we've had since I started (not
happy).  But the good news is we have lots of options to make sure this
particular issue doesn't happen again.

The problem: Is sort of a 3 fold issue. 1) NFS is running on the xen dom0
(will be fixed next week)  2) nfs lock is hanging and disallowing our
clients to lock files 3) When it gets in this state a restart of nslock
won't fix the problem, the port stays open so we have to restart the host.

There's not much we can do about 2 or 3 except rely on upstream to fix the
problem.  Seth suggested that when we fix 1) we make it a RHEL4 box.  I
think this is probably the best solution.


I'm also looking at other solutions for some of our other applications.
Our environment isn't in terrible shape but I think it could be better.
I'm looking at some different tools that might make things easier on us.
I think our environment is ok but there's some apps that don't need to be
load balanced, just more HA.  And sometimes those apps are getting beaten
up by apps that do need to be properly load balanced.  I'm going to go
through and see what some of our options are and I encourage those
familiar with our environment to do the same.

Some of our apps certainly need work (smolt especially) but I think there
are things we can do as infrastructure members to mitigate risks and
ensure we're seeing fewer outages.

	-Mike




More information about the Fedora-infrastructure-list mailing list