[Libguestfs] [nbdkit PATCH 3/5] api: Add nbdkit_string_intern helper

Eric Blake eblake at redhat.com
Tue Aug 25 20:32:26 UTC 2020


On 8/25/20 2:21 PM, Richard W.M. Jones wrote:
> On Tue, Aug 25, 2020 at 10:46:57AM -0500, Eric Blake wrote:
>> + const char *nbdkit_string_intern (const char *str);
>> +
>> +Returns a copy of str, so that the caller may reclaim str and use the
>> +copy in its place.  If the copy is created outside the scope of a
>> +connection (such as during C<.load>), the lifetime of the copy will
>> +last at least through C<.unload>; if it was created within a
>> +connection (such as during C<.preconnect>, C<.default_export>, or
>> +C<.open>), the lifetime will last at least through C<.close>.
> 
> Interesting - this isn't exactly how I imagined this would work.
> Would it make sense to intern strings for the lifetime of the server
> no matter where they are allocated from?

If you allocate 100 bytes for every connection, but no longer reference 
the memory after .close, then intern'ing for the lifetime of the server 
will add up to a megabyte of usage after 10,000 guests even if most of 
those clients are no longer connected; while intern'ing just for the 
lifetime of the connection does not.  It's actually pretty decent 
heuristics (and thus I didn't give the API a knob) - if you are 
allocating a string in response to .load, .config, or other callback 
outside of a connection context, you probably want that string to last 
until .unload; if you are allocating a string in response to 
.list_exports, .default_export, .open, or anything else in response to a 
given client connection, you probably don't need that string any longer 
than the .close of that same client, except maybe for the first client 
(but even then, you could use .get_ready to do that sort of global setup 
without waiting for the first connection).

> 
> Another suggestion would be for the function to take a flag indicating
> preferred lifetime, but that presents additional complications for the
> caller.

Yeah, I did think about a flag, and decided it wasn't worth it.  When 
called outside the context of a connection, the API would fail if you 
request in-connection lifetime; and when called inside the context of a 
connection, requesting an out-of-connection lifetime gets us right back 
to that appearance of a memory leak if it is repeated for every 
connection (even though we aren't actually leaking anything, and 
valgrind is happy, the constant growth of memory usage as more clients 
come and go is annoying); and if it is not repeated for every 
connection, it should be just as easy to do the initialization during 
.get_ready instead of needing a knob.

> 
> Also I imagined that this function would be a true (LISP-style)
> INTERN, in that it would return a single pointer if called twice with
> the two equal strings.  But I see from the implementation later on
> that this isn't the case.

It isn't now, but the contract doesn't prevent it from doing so in the 
future.  Callers are only promised that the lifetime is at least as long 
as .close, but nothing would stop us from making it longer if that made 
our implementation use memory more compactly.  But right now, I don't 
even bother hashing or searching for duplicates, but always append at 
the end.  Performance wise, it may mean a bit more memory usage and/or 
cleanup time, but so far there were few enough uses that the main 
benefit (of a lifetime guaranteed by nbdkit, with automatic cleanup 
without bookkeeping in the plugin) seemed to outweigh the need for more 
complexity.

> 
>> -      keys[optind] = strndup (argv[optind], n);
>> -      if (keys[optind] == NULL) {
>> -        perror ("strndup");
>> +      CLEANUP_FREE char *key = strndup (argv[optind], n);
>> +      const char *safekey = nbdkit_string_intern (key);
>> +      if (safekey == NULL)
>>           exit (EXIT_FAILURE);
> 
> It seems it might also be nice to have a "strndup" version of the
> intern function.

Or even a memdup.  Yeah, those are possibilities as well.  Doing all 
three (strdup, strndup, memdup) up front isn't hard.  If API 
proliferation is a problem, you can combine strdup and strndup by 
accepting -1 as a length to mean stop at NUL termination; but I'm less 
worried about API proliferation and more about ease-of-use.

> 
> I'm not completely sure which of these behaviours is better.  Ideally
> back when I started writing nbdkit I would have used a pool allocator,
> but that's all water under the bridge now.  So none of the above
> indicates I have a preference for a particular implementation, just
> ideas for further discussion.

The nice part of the API as proposed is that a pool allocator for 
intern'd strings is still viable!  Well, we have to be careful that a 
realloc doesn't invalidate prior return values, but nothing is requiring 
us to do a fresh malloc() per intern, rather than just carving out a 
slice of a larger pool when room remains and only malloc'ing large 
chunks (maybe 64k) when room is not available; where cleanup at the end 
of a connection or server lifetime only has to free the chunks instead 
of individual intern'd strings within those chunks.  In contrast, if we 
had instead proposed an API where the user transfers ownership of a 
malloc()d string into the server, and the server is now responsible for 
calling free(), is much more constrained (we can't change that interface 
without breaking existing clients).  Thus, avoiding a hard dependence on 
malloc() in our ABI is a good thing.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




More information about the Libguestfs mailing list