Fedora Makes a Terrible Server?

Wed Mar 26 00:49:40 UTC 2008

Les Mikesell wrote:
> Roger Heflin wrote:
>>
>>> I can't recall ever being in a position of "having to bring in new 
>>> hardware".  What scenario forces this issue on you?  I haven't 
>>> noticed a shortage of vendors who will sell RHEL supported boxes.  
>>> But it sounds like you have an interesting job...
>>>
>>
>> More cpu power needed to do the job.   And the new boxes aren't 
>> officially RHEL supported (and sometimes won't even boot with the 
>> latest update-but will work with the latest fedora/kernel.org). 
> 
> Something faster than IBM could sell you?

At the time, yes, this was before IBM sold AMD stuff, and the early though 
troublesome Athlon's were faster than the Intel stuff.
> 
>> I had a subset of machines (about 250 machines) all of which had 
>> reached about 500+ days of uptime (the uptime counter rolled over)
> 
> Wasn't that fixed circa RH8?  I had some 7.3 machines roll over twice.

It was pre-RH8.

> 
>> The issue with all OSes is that no one tests enough to catch these 
>> high MTBF issues, and in a big environment a machine crashing 1x per 
>> every 1000 days of uptime, comes to 1 machine a day crashing because 
>> of software, and typically the enterprise OSes aren't even close to 
>> that level, and while fedora is worse, it is just not that much worse.
> 
> I don't think RH7.3 with its final updates or Centos3.x (where x>1) had 
> anything approaching a software crash per 1000 days - at least not in 
> the base system and common services.  I mostly skipped the 4.x series 
> because I didn't trust the early 2.6 kernels at all, but 5.1 seems solid.
> 

Both of them have issues if you are running NFS servers with lots of clients, 
other than that they are pretty stable, but if you are relying on NFS heavily 
that is a show-stopper, but once you get a working stable setup if you really 
want stability you don't touch it, no matter how good anyone tests things, they 
will miss something, and things are worse the more different applications you 
are running, all doing different odd things each of which may find one the bugs 
no one at Redhat/Suse found in their testing.

And on top of that I have had trivial driver changes in the enterprise OSes 
cause huge performance regressions (an FC driver update changed the queue depth 
to 64-which caused the speed to be 30% of what it was before on certain external 
FC raid disk arrays-this affected SLES9sp3 (9sp[12] kernel was ok), SLES10, any 
kernel.org with the newer driver, RHEL4(all of them at the time)), so no update 
can be counted on to not cause issues, this error was not seen by the driver 
maintainer until they got one of the external arrays to test with and saw it 
compared to a competitors board that was 3x faster under the newer kernel, but 
almost identical under the older kernel, and both RHEL and Sles testing did not 
catch it, to fix it we actually had to update to an unreleased driver that 
allowed the queue depth to be changed down (none of the updates at the time 
fixed it), and wait for a update on Sles.    To get this fixed it was far easier 
to work with the upstream driver maintainer and get them to push the update to 
the enterprise vendors than to try to get the enterprise vendors to find and fix 
the problem.    I was told by a different upstream maintainer that typically the 
enterprise vendors pushed any serious issues directly to them, and did very 
little with the issues themselves, if you could get it past their first line 
support people.

The big problem is that the testing has to include it does not crash, it runs 
roughly the same speed as before, and it still gives the same answer, and even 
if one runs every test they know about, some configuration will still get 
through for a given setup.

I guess my experience is that even with Enterprise updates at least 25-50% of 
the time there is a serious regression (speed, crash, wrong answer), one has to 
carefully consider what do I gain by doing that update.   So the testing require 
for a full update is really no better than the testing required to go from F7 to 
F8, and fedora updates puts out new kernels faster so getting a fix into the 
stream is a lot faster than with the enterprise oses, and once you get one that 
works on a given piece of HW correctly you stop updating, some of the things I 
have ran into on an update are things you would have never though to test for, 
so you have to watch out on any update-and it is just best to only update when 
required to.

Some of the customers I used to support typically stayed on what was shipped 
with the machine, because their validation procedures were fairly extensive, and 
the update not worth it when no useful features were to be gained.