[olpc-software] graceful handling of out-of-memory conditions

Sun Mar 26 19:03:00 UTC 2006

Alan Cox wrote:
> I still see it differently. If your code does not check a malloc return then
> it is broken. Since low memory accesses on some systems might allow patching
> of code it must also be considered unfit to ship for security reasons. Most
> modern code gets this right, in part because the old BSD approach of 'its hard
> so lets not bother' has been replaced by rigour in all the camps, notably
> OpenBSD. In addition tools both free (sparse) and non-free (eg Coverity) can
> systematically identify missing NULL checks.

I don't really believe most code that attempts to handle OOM _works_ - 
the reason is that I wrote dbus to handle OOM, and then later added 
comprehensive test coverage by having the test suite run every code path 
over and over, failing the first malloc on first run, second malloc on 
second run, etc. Thus testing handling of NULL return for every single 
malloc.

After adding the test suite, I think I probably had a bug in how at 
least 5-10% of null malloc returns were handled. In many cases the bug 
was quite complex to fix, because what you have to do is make everything 
"transactional" which (depending on the code) can be arbitrarily 
complicated. You also have to add return values and failure codes in 
lots of places that might not have them before, which can modify a 
public API pretty heavily. Once you add the complex "transactional" 
code, it then never gets tested (unless you have a test suite like the 
one I did for dbus).

Making something sane happen on OOM is a lot more work than just adding 
"if (!ptr)" checks.

If we assume that most apps are half as complicated as dbus, and most 
programmers are twice as smart as I am, you're still talking about 2-3% 
of theoretically-handled malloc failures won't be handled properly. If 
you think about most OOM situations, we'd probably get multiple malloc 
failures, and get up to a pretty good chance of things breaking. It's 
just not gonna be reliable.

Another thing to keep in mind is that I think handling OOM probably adds 
10-20% of code size overhead to dbus. It's a lot of extra code... which 
you pay for when writing it, maintaining it, and running it.

You also have to think about what an app does on OOM ... for dbus it 
returns an error code for the current operation, then goes back and sits 
in the main loop, keeps returning error codes for any operations that 
don't have enough memory... if it can't even get enough memory to return 
an error, then I believe it just sleeps for a bit and tries again. For 
most gui apps "go back to the main loop and sleep a little while" is 
about the best they'll be able to do. Only rarely (for e.g. a large 
malloc when opening an image file) does it make sense to display a 
malloc failure as an error dialog.

Given all this, just having malloc() block and always succeed is 
tempting, with the main problem being large mallocs like the 
opening-an-image-file example... glib has g_try_malloc() to distinguish 
that case, since the normal glib behavior is to exit on OOM.

Another complexity that applies to a normal Linux system but perhaps not 
to OLPC is that with e.g. the default Fedora swap configuration, the 
system is unusably slow and thoroughly locked up long before malloc 
fails. It's awfully tempting to push the power switch when the "you are 
out of memory" dialog starts taking 30 minutes to come up, instead of 
waiting patiently to press the button on said dialog.

Havoc