[olpc-software] graceful handling of out-of-memory conditions
Havoc Pennington
hp at redhat.com
Sun Mar 26 19:03:00 UTC 2006
Alan Cox wrote:
> I still see it differently. If your code does not check a malloc return then
> it is broken. Since low memory accesses on some systems might allow patching
> of code it must also be considered unfit to ship for security reasons. Most
> modern code gets this right, in part because the old BSD approach of 'its hard
> so lets not bother' has been replaced by rigour in all the camps, notably
> OpenBSD. In addition tools both free (sparse) and non-free (eg Coverity) can
> systematically identify missing NULL checks.
I don't really believe most code that attempts to handle OOM _works_ -
the reason is that I wrote dbus to handle OOM, and then later added
comprehensive test coverage by having the test suite run every code path
over and over, failing the first malloc on first run, second malloc on
second run, etc. Thus testing handling of NULL return for every single
malloc.
After adding the test suite, I think I probably had a bug in how at
least 5-10% of null malloc returns were handled. In many cases the bug
was quite complex to fix, because what you have to do is make everything
"transactional" which (depending on the code) can be arbitrarily
complicated. You also have to add return values and failure codes in
lots of places that might not have them before, which can modify a
public API pretty heavily. Once you add the complex "transactional"
code, it then never gets tested (unless you have a test suite like the
one I did for dbus).
Making something sane happen on OOM is a lot more work than just adding
"if (!ptr)" checks.
If we assume that most apps are half as complicated as dbus, and most
programmers are twice as smart as I am, you're still talking about 2-3%
of theoretically-handled malloc failures won't be handled properly. If
you think about most OOM situations, we'd probably get multiple malloc
failures, and get up to a pretty good chance of things breaking. It's
just not gonna be reliable.
Another thing to keep in mind is that I think handling OOM probably adds
10-20% of code size overhead to dbus. It's a lot of extra code... which
you pay for when writing it, maintaining it, and running it.
You also have to think about what an app does on OOM ... for dbus it
returns an error code for the current operation, then goes back and sits
in the main loop, keeps returning error codes for any operations that
don't have enough memory... if it can't even get enough memory to return
an error, then I believe it just sleeps for a bit and tries again. For
most gui apps "go back to the main loop and sleep a little while" is
about the best they'll be able to do. Only rarely (for e.g. a large
malloc when opening an image file) does it make sense to display a
malloc failure as an error dialog.
Given all this, just having malloc() block and always succeed is
tempting, with the main problem being large mallocs like the
opening-an-image-file example... glib has g_try_malloc() to distinguish
that case, since the normal glib behavior is to exit on OOM.
Another complexity that applies to a normal Linux system but perhaps not
to OLPC is that with e.g. the default Fedora swap configuration, the
system is unusably slow and thoroughly locked up long before malloc
fails. It's awfully tempting to push the power switch when the "you are
out of memory" dialog starts taking 30 minutes to come up, instead of
waiting patiently to press the button on said dialog.
Havoc
More information about the olpc-software
mailing list