Problems with PowerMac 2.7GHz cooling with 2.6.15-1.1824_FC4

Brian D. Carlstrom bdc at carlstrom.com
Fri Jan 20 02:46:18 UTC 2006


I was running 6 G5 machines with various FC4 kernels up to and including
2.6.14-1.1656_FC4. However, our two newest 2.7GHz dual processor
machines were be powering off by the therm_pm72 driver because of
overheating. The problem was confirmed by use of the
/sbin/critical_overtemp callback that the therm_pm72 driver provides.
Since we are using these machines as compute boxes, we have been limping
along with a critical_overtemp script that logged the invocation and
rebooted (instead of powering off.) 

Recently I saw that there was patch to 2.6.15 to fix a bug in therm_pm72
that was contributing to the overtemp situation, specifically this patch:

commit 6ee7fb7e363aa8828b3920422416707c79f39007
Author: Benjamin Herrenschmidt <benh at kernel.crashing.org>
Date:   Mon Dec 19 11:24:53 2005 +1100

    [PATCH] powerpc: g5 thermal overtemp bug
    
    The g5 thermal control for liquid cooled machines has a small bug, when
    the temperatures gets too high, it boosts all fans to the max, but
    incorrectly sets the liquids pump to the min instead of the max speed,
    thus causing the overtemp condition not to clear and the machine to shut
    down after a while. This fixes it to set the pumps to max speed instead.
    This problem might explain some of the reports of random shutdowns that
    some g5 users have been reporting in the past.
    
    Many thanks to Marcus Rothe for spending a lot of time trying various
    patches & sending log logs before I found out that typo. Note that
    overtemp handling is still not perfect and the machine might still
    shutdown, that patch should reduce if not eliminate such occcurences in
    "normal" conditions with high load. I'll implement a better handling
    with proper slowing down of the CPUs later.
    
    Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>
    Signed-off-by: Linus Torvalds <torvalds at osdl.org>

I saw that the 2.6.15-1.1823_FC4 kernel had this patch so I tried that
but now have a different problem. The cooling system runs full blast
because the hardware is receiving the usually once a second commands
from the OS. This is very similar to the situation with the old FC4
kernels such as 2.6.11-1.1369_FC4 where the therm_pm72 driver was not
enabled because it only checked for PowerMac7,2 machines (as suggested
by pm72 in the driver name) and the new machines were detected as
PowerMac7,3. The machines no longer reboot, but the room that they are
in (a shared office for 3) is not inhabitable because of the noise.

Today I switched to 2.6.15-1.1824_FC4 to confirm that the situation
remains unchanged. The only thing I notice superficially different about
the kernals it that the therm_pm72 was built as a kernel module in all
FC4 kernels until now but this has changed to be built into the kernel
directly. I'm not sure if this could be a problem but I wanted to
mention it.

In any case, for the short term my officemates and I have fled to other
offices. I'll probably tack a wack at debugging the problem, but wanted
to post to let people know of the problem with the new kernel, and old
kernel for that matter!

-bri

P.S. - I'll also note that benh, the driver author said elsewhere that
the overtemp problem is partially a manufacturing problem with the later
machines. We seem to be seeing that as our original 2.7GHz machine that
we got when they first came out does not have any cooling problems even
though its our server and has a far higher load that the other machines.

-bri




More information about the fedora-test-list mailing list