[Linux-cluster] Clustering and suspend-to-disk?

Wed Dec 20 21:36:25 UTC 2006

Hi.

On Wed, 2006-12-20 at 10:15 -0500, Lon Hohberger wrote:
> On Wed, 2006-12-20 at 12:13 +1100, Nigel Cunningham wrote:
> > Hi all.
> > 
> > A long while ago now, I spoke with someone (who I'll keep anonymous)
> > about the possibility of suspending a cluster to disk. The person seemed
> > to be reasonably excited about the idea, since it would potentially be
> > quite useful in a power outage situation with limited UPS capability
> > (particularly where the state of computations couldn't easily be
> > serialised and restarted later).
> > 
> > I'm now in a situation where I don't have a lot of time to work on it,
> > but am interested in starting to make modifications to Suspend2 to add
> > such support. Before I do it, though, I wanted to ask whether you guys
> > as a whole would be interested in such support, or whether you think I'd
> > be wasting my time.
> 
> These are not questions looking for answers - they're things to think
> about (and there will be more):
> 
> * What happens if the suspend fails for one or more nodes?  Is the
> cluster state lost as a whole?

I was thinking about that too. Am I right in imagining that there will
already be mechanisms in place to handle nodes disappearing &
reappearing without warning? If so, the suspend could continue for the
rest of the nodes, and nodes to suspend would appear at resume time as
if they'd spontaneously rebooted.

The causes of failing to suspend are:
- Failure to freeze all processes. Generally not a problem anymore on
x86, should be similar and any exceptions addressable on other
platforms;
- Failure to obtain sufficient memory and/or storage to write the image.
Sufficient memory is less of a problem with Suspend2 than it is with the
alternatives because we free far less memory. Re sufficient storage,
ordinary files can be used rather than swap, so that (assuming the space
is sufficient to start with) this isn't an issue.
- Failure of devices to suspend/resume when doing the atomic copy when
writing the image. I'm also imagining that drivers would be
(increasingly) well tested, so failures would become more and more rare.
When they do occur, mechanisms such as PM_TRACE could be used to
diagnose the cause.
- Hardware failure: Nothing I could do about that.

> * What if the resume fails for one or more nodes?  How do you handle
> getting the cluster back online automatically?

I'm thinking of the same logic as above. At resume time, the only things
that can cause failure to resume are drivers and hardware (ie nothing in
the process itself - we make sure while suspending that we're able to
resume).

> * No matter how well STD & resume work, there will be changes while the
> cluster is offline which you will need to be able to handle during /
> after the resume phase (TCP connections & DHCP leases time out for
> example).

Right. They can be handled by scripts that can be run before the cycle
is started and after completion.

> > After that, I'd like to work toward
> > supporting suspending to shared storage.
> 
> On suspending to shared storage:
> 
> * Do you intend to be able to use this to replace machines?

If they have the same hardware, you could suspend, stick the storage in
the new machine and resume today. For machines with different hardware,
we'd be reliant on developments in the hotplugging code and more
reliable failure handling in drivers (ie not getting oopses when the
hardware isn't there post-resume, for example).

> * How can one prevent a machine from resuming from the wrong memory
> image (or two machines resuming from the same image)?

I haven't thought about it in detail yet, but would imagine that we'd
use some sort of identifier (tcpip/mac/... address) to find the right
image.

Thanks for your response.

Regards,

Nigel