[Linux-cluster] 2-node fencing question (IPMI/ACPI question)
lhh at redhat.com
Mon Sep 11 15:32:18 UTC 2006
On Tue, 2006-09-05 at 07:43 -0400, danwest wrote:
> What happens if the servers you are using require ACPI=on in order to
> boot. For instance IBM X366 servers need ACPI set in order to boot.
> With ACPI=on both nodes reboot when a fence occurs(see "both nodes off
> problem" in thread below). This is not desirable, especially with
> active/active clusters.
Hopefully, the X366 either turns off immediately or can be configured to
do so upon getting the "power off" command with ACPI enabled. If it
does not, then you will need remote power control or fabric-level
Here is some relevant background information.
If you look at the IPMI v1.5 and v2 specifications, the instruction 0
for power control is supposed force the system to S4/S5 (soft-off) state
immediately (for use in emergency situations). If you then look at the
ipmitool source code, you will find that it uses the 0 instruction when
you do a 'chassis power off' command.
(quote, source =
http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0_E3_markup.pdf - page 403):
[3:0] - chassis control
0h = power down. Force system into soft off (S4/S45) state. This is
for `emergency' management power down actions. The command
does not initiate a clean shut-down of the operating system
prior to powering down the system.
The reason linux-cluster often needs ACPI disabled with IPMI is because
in many cases, machines which receive this "emergency power off"
instruction do not appear to operate as what is stated in the IPMI
specification. That is, some do a full, complete, clean shutdown when
ACPI is enabled. If the shutdown never completes, fencing will never
complete and the cluster will never recover.
Now, not all machines behave this way. If your machine powers off
immediately with ACPI enabled, then you do not need to disable ACPI.
(Note: cheating by switching the acpid event for power button presses
to /sbin/poweroff -fn does *not* count!)
It is possible that some machines are - quite simply - twiddling the
motherboard's soft power button. In that case, it is possible that
those machines can also be configured to do an immediate-off in the BIOS
when the power button is pressed, thereby alleviating the need for
booting with ACPI disabled.
There may be other ways to work around the ACPI/IPMI problem on your
specific hardware; this is just an example. Booting with ACPI disabled
is the general "quick fix", which works immediately for the majority of
machines with IPMI - and does not require hardware-specific
configuration. Booting with ACPI disabled also works for other types of
integrated power management (iLO, RSA, DRAC, etc.) which often suffer
the same problems.
As noted by others in separate emails to this list, it would be nice if
we could use the reboot operations more often - rather than "off, on"
cycles in all cases.
Most fencing solutions can not (as far as I know) confirm that a machine
has rebooted the way it can confirm that a machine is "off" or "on". Of
course, "reboot" does not suffer the theoretical "everyone off at once"
problem, and it should eliminate the need boot with ACPI disabled.
More information about the Linux-cluster