[Libguestfs] Memory corruption when testing nbdkit python plugin with nbd-tester-client?

Tue Sep 27 08:26:18 UTC 2016

On 26.09.2016 20:40, Carl-Daniel Hailfinger wrote:
> On 26.09.2016 19:20, Carl-Daniel Hailfinger wrote:
>> On 26.09.2016 14:29, Richard W.M. Jones wrote:
>>> On Mon, Sep 26, 2016 at 02:18:02PM +0200, Carl-Daniel Hailfinger wrote:
>>>> Hi,
>>>>
>>>> has anyone ever run "make check" from nbd against nbdkit with a python
>>>> plugin? I usually get segfaults during such a run, and sometimes various
>>>> other errors happen before the segfault, suggesting that some memory
>>>> corruption is underway.
>>>> AFAICS a pure python plugin should not be able to cause memory corruption.
>>> Correct, a python plugin should not cause memory corruption,
>>> and nbdkit shouldn't segfault ever.
>>>
>>> Did you get a stack trace from C (not from Python)?
>> The core files were useless, but I ran nbdkit in gdb and got something...
> And another one... my close function pickles all python objects which
> were accessed in the plugin. Maybe that's what exposes the memory
> corruption on close.
> [...]
> Will try instrumenting the close function next.

The results are in.

I dumped the reference counters for various python objects inside
py_close and often the reference counters are off-by-one or off-by-two
in the crashing case.

AFAICS the python API says that once you have threads, you have to do
the GIL dance:
https://docs.python.org/2/c-api/init.html#thread-state-and-the-global-interpreter-lock
. All the crashes I saw can be explained by unsynchronized accesses to
python objects.

I found the bug.
plugin_lock_request() is used to lock client requests only, but it does
_not_ lock any other calls into the plugin. Due to that, the following
calls are neither locked against each other nor are they locked against
client requests:
plugin_register (const char *_filename,
plugin_cleanup (void)
plugin_config (const char *key, const char *value)
plugin_config_complete (void)
plugin_open (struct connection *conn, int readonly)
plugin_close (struct connection *conn)
plugin_get_size (struct connection *conn)
plugin_can_write (struct connection *conn)
plugin_can_flush (struct connection *conn)
plugin_is_rotational (struct connection *conn)
plugin_can_trim (struct connection *conn)

That means dumping the plugin-internal data on connection close is not
locked against the requests within that connection, and any garbage
collection triggered by the last request may still be running while
plugin_close runs. Boom.

In general, when working with python any call into python should be
wrapped like this:

PyGILState_STATE gstate;
gstate = PyGILState_Ensure();

/* Perform Python actions here. */
result = CallSomeFunction();
/* evaluate result or handle exception */

/* Release the thread. No Python API allowed beyond this point. */
PyGILState_Release(gstate);

That would even allow us to run everything in parallel.

I have now added a call to plugin_lock_request() within plugin_close()
and will check if it still explodes. That's obviously less than what the
python developers recommend, but it may at least reduce the likelihood
of the crashes.

Regards,
Carl-Daniel