[libvirt] (Dropping) OOM Handling in libvirt
Michal Privoznik
mprivozn at redhat.com
Mon May 13 13:28:24 UTC 2019
On 5/13/19 12:17 PM, Daniel P. Berrangé wrote:
> This is a long mail about ENOMEM (OOM) handling in libvirt. The executive
> summary is that it is not worth the maint cost because:
>
> - this code will almost never run on Linux hosts
>
> - if it does run it will likely have bad behaviour silently dropping
> data or crashing the process
>
> - apps using libvirt often do so via a non-C language that aborts/exits
> the app on OOM regardless, or use other C libraries that abort
>
> - we can build a system that is more reliable when OOM happens by
> not catching OOM, instead simply letting apps exit, restart and
> carry on where they left off
>
> The long answer follows...
I'm up for dropping OOM handling. Since we're linking with a lot of
libraries I don't think we can be confident that we won't abort() even
now anyway.
>
>
> The background
> ==============
>
> Since the first commit libvirt has attempted to handle out of memory (OOM)
> errors in the same way it does for any other problem. Namely, a virError will
> be raised, and the method will jump to its cleanup/error label. The error
> bubbles up the stack and in theory someone or something will catch this and do
> something sensible/useful. This has long been considered "best practice" for
> most C libraries, especially those used for system services. This mail makes
> the argument that it is in fact /not/ "best practice" to try to handle OOM.
>
> OOM handling is very difficult to get correct because it is something that
> developers almost never encounter and thus the code paths are rarely run. We
> designed our memory allocation APIs such that we get compile time errors if
> code forgets to check the return value for failure. This is good as it
> eliminates an entire class of bugs. Our error handling goto label pattern
> tries to align OOM handling with other general error handling which is more
> commonly tested.
>
> We have code in the allocators which lets us run unit tests simulating the
> failure of any allocation that is made during the test. Executing this is
> extraordinarily time consuming as some of our unit tests have many 1000's or
> even 10's of 1000's of allocations. The running time to simulate OOM for each
> allocation is O(N^2) which does not scale well. As a result we've probably
> only run these OOM tests a handful of times over the years.
Well, I've tried this and so far, the only problems I've found were in
tests where we're ignoring revals of VIR_ALLOC() or its friends (e.g.
vir*New()) or we're checking it after it's used already.
BTW: to make it work I had to disable failing from mocks (obviously).
>
> The tests show we generally do remarkly well at OOM handling, but none the
> less we have *always* found bugs in our handling where we fail to report the
> OOM (as in, continue to run but with missing data), or we crash when cleaning
> up from OOM. Our unit tests don't cover anywhere near all of our codebase
> though. We have great coverage of XML parsing/formatting, and sporadic coverage
> of everything else.
>
> IOW, despite our clever API designs & coding patterns designed to improve our
> OOM handling, we can not have confidence that we will actually operate correctly
> in an OOM scenario.
I beg to disagree. The fact that our cleanup paths are in 99% the same
as error paths means that if there are no mem-leaks (and if
transformation to VIR_AUTOFREE is done the chances are low), then we
propably do good on OOM too.
<snip/>
> The implementation
> ==================
>
> Assuming a decision to abort on OOM, libvirt can nwo follow QEMU's lead and
> make use of the GLib library.
No, please no. Firstly, glib is a new C dialect that only a few of us
speak. Secondly, glib adds some padding around its objects => libvirt's
memory footprint will grow. Thirdly, it's yet another library to link
with (on my system libvirt links with 53 libraries already).
Michal
More information about the libvir-list
mailing list