[libvirt] (Dropping) OOM Handling in libvirt

Mon May 13 13:28:24 UTC 2019

On 5/13/19 12:17 PM, Daniel P. Berrangé wrote:
> This is a long mail about ENOMEM (OOM) handling in libvirt. The executive
> summary is that it is not worth the maint cost because:
> 
>    - this code will almost never run on Linux hosts
>    
>    - if it does run it will likely have bad behaviour silently dropping
>      data or crashing the process
> 
>    - apps using libvirt often do so via a non-C language that aborts/exits
>      the app on OOM regardless, or use other C libraries that abort
> 
>    - we can build a system that is more reliable when OOM happens by
>      not catching OOM, instead simply letting apps exit, restart and
>      carry on where they left off
> 
> The long answer follows...

I'm up for dropping OOM handling. Since we're linking with a lot of 
libraries I don't think we can be confident that we won't abort() even 
now anyway.

> 
> 
> The background
> ==============
> 
> Since the first commit libvirt has attempted to handle out of memory (OOM)
> errors in the same way it does for any other problem. Namely, a virError will
> be raised, and the method will jump to its cleanup/error label. The error
> bubbles up the stack and in theory someone or something will catch this and do
> something sensible/useful. This has long been considered "best practice" for
> most C libraries, especially those used for system services. This mail makes
> the argument that it is in fact /not/ "best practice" to try to handle OOM.
> 
> OOM handling is very difficult to get correct because it is something that
> developers almost never encounter and thus the code paths are rarely run. We
> designed our memory allocation APIs such that we get compile time errors if
> code forgets to check the return value for failure. This is good as it
> eliminates an entire class of bugs. Our error handling goto label pattern
> tries to align OOM handling with other general error handling which is more
> commonly tested.
> 
> We have code in the allocators which lets us run unit tests simulating the
> failure of any allocation that is made during the test. Executing this is
> extraordinarily time consuming as some of our unit tests have many 1000's or
> even 10's of 1000's of allocations. The running time to simulate OOM for each
> allocation is O(N^2) which does not scale well. As a result we've probably
> only run these OOM tests a handful of times over the years.

Well, I've tried this and so far, the only problems I've found were in 
tests where we're ignoring revals of VIR_ALLOC() or its friends (e.g. 
vir*New()) or we're checking it after it's used already.
BTW: to make it work I had to disable failing from mocks (obviously).

> 
> The tests show we generally do remarkly well at OOM handling, but none the
> less we have *always* found bugs in our handling where we fail to report the
> OOM (as in, continue to run but with missing data), or we crash when cleaning
> up from OOM. Our unit tests don't cover anywhere near all of our codebase
> though. We have great coverage of XML parsing/formatting, and sporadic coverage
> of everything else.
> 
> IOW, despite our clever API designs & coding patterns designed to improve our
> OOM handling, we can not have confidence that we will actually operate correctly
> in an OOM scenario.

I beg to disagree. The fact that our cleanup paths are in 99% the same 
as error paths means that if there are no mem-leaks (and if 
transformation to VIR_AUTOFREE is done the chances are low), then we 
propably do good on OOM too.

<snip/>

> The implementation
> ==================
> 
> Assuming a decision to abort on OOM, libvirt can nwo follow QEMU's lead and
> make use of the GLib library. 

No, please no. Firstly, glib is a new C dialect that only a few of us 
speak. Secondly, glib adds some padding around its objects => libvirt's 
memory footprint will grow. Thirdly, it's yet another library to link 
with (on my system libvirt links with 53 libraries already).

Michal