[libvirt] [PATCH v2 1/4] xml: add disk driver metadata_cache_size option

John Ferlan jferlan at redhat.com
Tue Dec 11 16:34:29 UTC 2018



On 12/11/18 5:22 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 10.12.2018 19:56, John Ferlan wrote:
>>
>> On 11/8/18 8:02 AM, Nikolay Shirokovskiy wrote:
>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy at virtuozzo.com>
>>> ---
>>>  docs/formatdomain.html.in                          |  8 ++++
>>>  docs/schemas/domaincommon.rng                      | 11 +++++
>>>  src/conf/domain_conf.c                             | 17 ++++++++
>>>  src/conf/domain_conf.h                             |  9 ++++
>>>  .../qemuxml2argvdata/disk-metadata_cache_size.xml  | 42 +++++++++++++++++++
>>>  .../disk-metadata_cache_size.xml                   | 48 ++++++++++++++++++++++
>>>  tests/qemuxml2xmltest.c                            |  2 +
>>>  7 files changed, 137 insertions(+)
>>>  create mode 100644 tests/qemuxml2argvdata/disk-metadata_cache_size.xml
>>>  create mode 100644 tests/qemuxml2xmloutdata/disk-metadata_cache_size.xml
>>>
>>
>> <sigh> looks like a forgotten thread...  It seems reviewer bandwidth is
>> limited and won't get much better during the last couple weeks of the
>> month when many if not most Red Hat employees are out to due company
>> shutdown period.
>>
>> You need to add a few words to the commit message to describe what's
>> being changed.  Oh and you'll also need a "last" patch to docs/news.xml
>> to describe the new feature.
>>
>>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>>> index 269741a..1d186ab 100644
>>> --- a/docs/formatdomain.html.in
>>> +++ b/docs/formatdomain.html.in
>>> @@ -3556,6 +3556,14 @@
>>>              virt queues for virtio-blk. (<span class="since">Since 3.9.0</span>)
>>>            </li>
>>>            <li>
>>> +            The optional <code>metadata_cache_size</code> attribute specifies
>>> +            metadata cache size policy, possible values are "default" and "maximum".
>>
>> s/policy, possible/policy. Possible/
>>
>>> +            "default" leaves setting cache size to hypervisor, "maximum" makes
>>
>> s/"default"/Using "default"/
>>
>> s/to hypervisor,/to the hypervisor./
>>
>> s/"maximum"/Using "maximum"
>>
>>> +            cache size large enough to keep all metadata, this will help if workload
>>
>>
>> "Using maximum assigns the largest value possible for the cache size.
>> This ensures the entire disk cache remains in memory for faster I/O at
>> the expense of utilizing more memory."
>>
>> [Editorial comment: it's really not 100% clear what all the tradeoffs
>> someone is making here. The phrase "large enough" just sticks out, but
>> since you use INT64_MAX in patch 3, I suppose that *is* the maximum.
>> Still in some way indicating that this allows QEMU to grow the cache and
>> keep everything in memory, but has the side effect that disks configured
>> this way will cause guest memory requirements to grow albeit limited by
>> the QEMU algorithms. It seems from my read the max is 32MB, so perhaps
>> not a huge deal, but could be useful to note. Whether QEMU shrinks the
>> cache when not in use wasn't 100% clear to me.]
>>
>> It's too bad it's not possible to utilize some dynamic value via
>> qcow2_update_options_prepare. If dynamic adjustment were possible, then
>> saving the value in the XML wouldn't be necessary - we could just allow
>> dynamic adjustment similar to what I did for of a couple of IOThread
>> params (see commit 11ceedcda and the patches before it).
>>
>>> +            needs access to whole disk all the time. (<span class="since">Since
>>> +            4.10.0, QEMU 3.1</span>)
>>
>> This will be at least 5.0.0 now.
>>
>>> +          </li>
>>> +          <li>
>>>            For virtio disks,
>>>            <a href="#elementsVirtio">Virtio-specific options</a> can also be
>>>            set. (<span class="since">Since 3.5.0</span>)
>>> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
>>> index b9ac5df..3e406fc 100644
>>> --- a/docs/schemas/domaincommon.rng
>>> +++ b/docs/schemas/domaincommon.rng
>>> @@ -1990,6 +1990,9 @@
>>>          <ref name="detect_zeroes"/>
>>>        </optional>
>>>        <optional>
>>> +        <ref name="metadata_cache_size"/>
>>> +      </optional>
>>> +      <optional>
>>>          <attribute name='queues'>
>>>            <ref name="positiveInteger"/>
>>>          </attribute>
>>> @@ -2090,6 +2093,14 @@
>>>        </choice>
>>>      </attribute>
>>>    </define>
>>> +  <define name="metadata_cache_size">
>>> +    <attribute name='metadata_cache_size'>
>>> +      <choice>
>>> +        <value>default</value>
>>> +        <value>maximum</value>
>>
>> I didn't go back and read the previous reviews nor was I present for the
>> KVM Forum discussion on this, but default really means "minimum" I
>> think. Even that just doesn't feel "right". There's 3 QEMU knobs, but
>> this changes 1 knob allowing one direction. And yes, changing 3 knobs is
>> also confusing. I "assume" that it's been determined that this one knob
>> has the greatest affect on I/O performance...
>>
>> Reading https://github.com/qemu/qemu/blob/master/docs/qcow2-cache.txt
>> some phrases stick out:
>>
>>   1. "setting the right cache sizes is not a straightforward operation".
>>   2. "In order to choose the cache sizes we need to know how they relate
>> to the amount of allocated space."
>>   3. The limit of the l2 cache size is "32 MB by default on Linux
>> platforms (enough for full coverage of 256 GB images, with the default
>> cluster size)".
>>   4. "QEMU will not use more memory than needed to hold all of the
>> image's L2 tables, regardless of this max. value."
>>
>> So at least providing INT_MAX won't hurt, although given the math in
>> qcow2_update_options_prepare:
>>
>>     l2_cache_size /= l2_cache_entry_size;
>>     if (l2_cache_size < MIN_L2_CACHE_SIZE) {
>>         l2_cache_size = MIN_L2_CACHE_SIZE;
>>     }
>>     if (l2_cache_size > INT_MAX) {
>>
>> wouldn't that mean we could pass up to "l2_cache_size *
>> l2_cache_entry_size"?
> 
> You mean up to INT_MAX / l2_cache_entry_size? No, because there
> is limitation also that comes from read_cache_sizes so cache
> will be limited to maximum needs of disk size.
> 

So many dependencies... So much math to look at.

>>
>> Still using minimum/maximum for what amounts to a boolean operation is I
>> think unnecessary. Rather than a list of values, wouldn't having
>> something like "max_metadata_cache='yes'" be just as effective?
> 
> Well introducing metadata_cache_size enum is what it was agreed on KVM Forum...
> 

But yet providing maximum runs into immediate issues because of bugs
that are fixed in 3.1.

Devil's advocate says - why not patch QEMU to add something that causes
the allocation algorithm to use the maximum values for great I/O
performance - that way mgmt apps don't have to figure out the math and
if the QEMU algorithm for this ever changes and guest XML settings don't
have a negative impact. The mgmt app just has to signify it's desire to
use algorithms that "can" improve I/O performance significantly.

>>
>> Alternatively, reading through the above document and thinking about how
>> "useful" it could be to use a more specific value, why not go with
>> "metadata_cache_size=N", where N is in MB and describes the size of the
>> cache to be used. Being in MB ascribes to the need to be a power of 2
>> and quite frankly keeps a reasonable range assigned to the value. I'd
>> find it hard to fathom a "large enough" disk would improve that much I/O
>> performance in KB adjustments, but I haven't done any analysis of that
>> theory either.
> 
> There a lot of discussion of this cache parameter in this list, bugzilla
> (sorry for not providing links) that boils down to one does not know
> how to set it properly, so having option that is in MB is hard to use. On

I know there's lots of discussion on this - you did list a few series in
your first cover. Using MB was merely a reaction to reading the document
and seeing it says "The maximum L2 cache size is 32 MB by default on
Linux platforms". So a value between 1 to 32 could be used if someone
wants to go through the pain of figuring out the nuances of the
algorithm and difference in performance in their environment for each
step. Not providing a "set" number though satisfies the desire to avoid
some number being baked into the guest XML permanently.

> the other hand there are cases (synthetic tests, large database) when default
> cache size is clearly not big enough so in course of discussion
> it was decided to have besides default cache size also cache size that
> cover whole disk so IO performance will not degrade whatever guest will do.

I understand the need... Many previous series have shown that!

> Thus this 'maximum' set. 'default' is just an explicit default (implicit
> is omitted metadata_cache_size option of course).

The chosen implementation is boolean - all or default. Was there
discussion that also noted that perhaps we should add a 3rd or 4th enum
or string to describe some condition between default and maximum
("quarter", "half", "threefourths")?

In the long run it doesn't matter, but as the implementation is posted
it is a boolean operation. I can live with "minimum" and "maximum", but
feel it's important to note the obvious especially since trying to
describe anything between could take the PHD (piled higher and deeper)
of doc writing abilities.

> 
> Also having enum instead of bool is a bit future proof )

In my mind only if the choice is at some time in the future to add
something between default and maximum. I honestly don't see that
happening, but if the group decision at KVM Forum was to go with some
sort of enum - I can live with that. Perhaps something was discussed
there that just isn't obvious with my short term tunnel vision.

John

> 
> Nikolay
> 
>>
>> I think the above document also gives a "fair" example that could be
>> "reworded" into the libvirt docs in order to help "size" things based on
>> the default QEMU values we're not allowing to be changed and then
>> showing the math what it means to have a 1MB cache, a 2MB cache, an 8MB,
>> etc. vis-a-vis the size of the disk for which the cache is being
>> created. For example, for a disk of up to 8G, the "default" cache would
>> be XXX and setting a value of YYY would help I/O.  Similarly for a 16GB,
>> 64GB, etc. Essentially a bit of a guide - although that's starting to
>> border on what should go into the wiki instead.
>>
>>> +      </choice>
>>> +    </attribute>
>>> +  </define>
>>>    <define name="controller">
>>>      <element name="controller">
>>>        <optional>
>>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>>> index 237540b..b2185f8 100644
>>> --- a/src/conf/domain_conf.c
>>> +++ b/src/conf/domain_conf.c
>>> @@ -885,6 +885,11 @@ VIR_ENUM_IMPL(virDomainDiskDetectZeroes, VIR_DOMAIN_DISK_DETECT_ZEROES_LAST,
>>>                "on",
>>>                "unmap")
>>>  
>>> +VIR_ENUM_IMPL(virDomainDiskMetadataCacheSize,
>>> +              VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_LAST,
>>> +              "default",
>>> +              "maximum")
>>> +
>>>  VIR_ENUM_IMPL(virDomainDiskMirrorState, VIR_DOMAIN_DISK_MIRROR_STATE_LAST,
>>>                "none",
>>>                "yes",
>>> @@ -9375,6 +9380,14 @@ virDomainDiskDefDriverParseXML(virDomainDiskDefPtr def,
>>>      }
>>>      VIR_FREE(tmp);
>>>  
>>> +    if ((tmp = virXMLPropString(cur, "metadata_cache_size")) &&
>>> +        (def->metadata_cache_size = virDomainDiskMetadataCacheSizeTypeFromString(tmp)) < 0) {
>>> +        virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
>>> +                       _("unknown driver metadata_cache_size value '%s'"), tmp);
>>> +        goto cleanup;
>>> +    }
>>> +    VIR_FREE(tmp);
>>> +
>>
>> This would just use the YesNo type logic instead or read in a numeric
>> value.  BTW: Not sure I'd bother with worrying about checking the QEMU
>> maximum as that could change and then we're stuck with a lower maximum
>> check. Just pass along the value to QEMU and let it fail.
>>
>>>      ret = 0;
>>>  
>>>   cleanup:
>>> @@ -23902,6 +23915,10 @@ virDomainDiskDefFormatDriver(virBufferPtr buf,
>>>      if (disk->queues)
>>>          virBufferAsprintf(&driverBuf, " queues='%u'", disk->queues);
>>>  
>>> +    if (disk->metadata_cache_size)
>>> +        virBufferAsprintf(&driverBuf, " metadata_cache_size='%s'",
>>> +                          virDomainDiskMetadataCacheSizeTypeToString(disk->metadata_cache_size));
>>> +
>>
>> My personal crusade... metadata_cache_size >
>> VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_DEFAULT (yes, > 0).
>>
>> Of course this all depends on your final solution - a boolean would only
>> be written if set and an int value only written if > 0 (since 0 would be
>> the "optional" viewpoint).
>>
>>>      virDomainVirtioOptionsFormat(&driverBuf, disk->virtio);
>>>  
>>>      ret = virXMLFormatElement(buf, "driver", &driverBuf, NULL);
>>> diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h
>>> index e30a4b2..b155058 100644
>>> --- a/src/conf/domain_conf.h
>>> +++ b/src/conf/domain_conf.h
>>> @@ -568,6 +568,13 @@ typedef enum {
>>>      VIR_DOMAIN_DISK_DETECT_ZEROES_LAST
>>>  } virDomainDiskDetectZeroes;
>>>  
>>> +typedef enum {
>>> +    VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_DEFAULT = 0,
>>> +    VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_MAXIMUM,
>>> +
>>> +    VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_LAST
>>> +} virDomainDiskMetadataCacheSize;
>>> +
>>>  typedef struct _virDomainBlockIoTuneInfo virDomainBlockIoTuneInfo;
>>>  struct _virDomainBlockIoTuneInfo {
>>>      unsigned long long total_bytes_sec;
>>> @@ -672,6 +679,7 @@ struct _virDomainDiskDef {
>>>      int discard; /* enum virDomainDiskDiscard */
>>>      unsigned int iothread; /* unused = 0, > 0 specific thread # */
>>>      int detect_zeroes; /* enum virDomainDiskDetectZeroes */
>>> +    int metadata_cache_size; /* enum virDomainDiskMetadataCacheSize */
>>>      char *domain_name; /* backend domain name */
>>>      unsigned int queues;
>>>      virDomainVirtioOptionsPtr virtio;
>>> @@ -3388,6 +3396,7 @@ VIR_ENUM_DECL(virDomainDeviceSGIO)
>>>  VIR_ENUM_DECL(virDomainDiskTray)
>>>  VIR_ENUM_DECL(virDomainDiskDiscard)
>>>  VIR_ENUM_DECL(virDomainDiskDetectZeroes)
>>> +VIR_ENUM_DECL(virDomainDiskMetadataCacheSize)
>>>  VIR_ENUM_DECL(virDomainDiskMirrorState)
>>>  VIR_ENUM_DECL(virDomainController)
>>>  VIR_ENUM_DECL(virDomainControllerModelPCI)
>>> diff --git a/tests/qemuxml2argvdata/disk-metadata_cache_size.xml b/tests/qemuxml2argvdata/disk-metadata_cache_size.xml
>>> new file mode 100644
>>> index 0000000..8ac2599
>>> --- /dev/null
>>> +++ b/tests/qemuxml2argvdata/disk-metadata_cache_size.xml
>>> @@ -0,0 +1,42 @@
>>> +<domain type='qemu'>
>>> +  <name>test</name>
>>> +  <uuid>468404ad-d49c-40f2-9e14-02294f9c1be3</uuid>
>>> +  <memory unit='KiB'>1048576</memory>
>>> +  <currentMemory unit='KiB'>1048576</currentMemory>
>>> +  <vcpu placement='static'>1</vcpu>
>>> +  <os>
>>> +    <type arch='x86_64' machine='pc-0.13'>hvm</type>
>>
>> Nothing more recent than pc-0.13 ;-) for this copy-pasta.
>>
>>> +    <boot dev='cdrom'/>
>>> +    <boot dev='hd'/>
>>> +    <bootmenu enable='yes'/>
>>> +  </os>
>>> +  <clock offset='utc'/>
>>> +  <on_poweroff>destroy</on_poweroff>
>>> +  <on_reboot>restart</on_reboot>
>>> +  <on_crash>restart</on_crash>
>>> +  <devices>
>>> +    <emulator>/usr/bin/qemu-system-x86_64</emulator>
>>> +    <disk type='file' device='disk'>
>>> +      <driver name='qemu' type='qcow2' metadata_cache_size='maximum'/>
>>> +      <source file='/var/lib/libvirt/images/f14.img'/>
>>> +      <target dev='vda' bus='virtio'/>
>>> +      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
>>> +    </disk>
>>> +    <disk type='file' device='cdrom'>
>>> +      <driver name='qemu' type='raw'/>
>>> +      <source file='/var/lib/libvirt/Fedora-14-x86_64-Live-KDE.iso'/>
>>> +      <target dev='hdc' bus='ide'/>
>>> +      <readonly/>
>>> +      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
>>> +    </disk>
>>
>> This second disk isn't necessary
>>
>> John
>>
>>> +    <controller type='usb' index='0'/>
>>> +    <controller type='virtio-serial' index='0'>
>>> +      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
>>> +    </controller>
>>> +    <controller type='ide' index='0'/>
>>> +    <controller type='pci' index='0' model='pci-root'/>
>>> +    <input type='mouse' bus='ps2'/>
>>> +    <input type='keyboard' bus='ps2'/>
>>> +    <memballoon model='virtio'/>
>>> +  </devices>
>>> +</domain>
>>> diff --git a/tests/qemuxml2xmloutdata/disk-metadata_cache_size.xml b/tests/qemuxml2xmloutdata/disk-metadata_cache_size.xml
>>> new file mode 100644
>>> index 0000000..5fed22b
>>> --- /dev/null
>>> +++ b/tests/qemuxml2xmloutdata/disk-metadata_cache_size.xml
>>> @@ -0,0 +1,48 @@
>>> +<domain type='qemu'>
>>> +  <name>test</name>
>>> +  <uuid>468404ad-d49c-40f2-9e14-02294f9c1be3</uuid>
>>> +  <memory unit='KiB'>1048576</memory>
>>> +  <currentMemory unit='KiB'>1048576</currentMemory>
>>> +  <vcpu placement='static'>1</vcpu>
>>> +  <os>
>>> +    <type arch='x86_64' machine='pc-0.13'>hvm</type>
>>> +    <boot dev='cdrom'/>
>>> +    <boot dev='hd'/>
>>> +    <bootmenu enable='yes'/>
>>> +  </os>
>>> +  <clock offset='utc'/>
>>> +  <on_poweroff>destroy</on_poweroff>
>>> +  <on_reboot>restart</on_reboot>
>>> +  <on_crash>restart</on_crash>
>>> +  <devices>
>>> +    <emulator>/usr/bin/qemu-system-x86_64</emulator>
>>> +    <disk type='file' device='disk'>
>>> +      <driver name='qemu' type='qcow2' metadata_cache_size='maximum'/>
>>> +      <source file='/var/lib/libvirt/images/f14.img'/>
>>> +      <target dev='vda' bus='virtio'/>
>>> +      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
>>> +    </disk>
>>> +    <disk type='file' device='cdrom'>
>>> +      <driver name='qemu' type='raw'/>
>>> +      <source file='/var/lib/libvirt/Fedora-14-x86_64-Live-KDE.iso'/>
>>> +      <target dev='hdc' bus='ide'/>
>>> +      <readonly/>
>>> +      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
>>> +    </disk>
>>> +    <controller type='usb' index='0'>
>>> +      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
>>> +    </controller>
>>> +    <controller type='virtio-serial' index='0'>
>>> +      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
>>> +    </controller>
>>> +    <controller type='ide' index='0'>
>>> +      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
>>> +    </controller>
>>> +    <controller type='pci' index='0' model='pci-root'/>
>>> +    <input type='mouse' bus='ps2'/>
>>> +    <input type='keyboard' bus='ps2'/>
>>> +    <memballoon model='virtio'>
>>> +      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
>>> +    </memballoon>
>>> +  </devices>
>>> +</domain>
>>> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c
>>> index 89640f6..c44e0fe 100644
>>> --- a/tests/qemuxml2xmltest.c
>>> +++ b/tests/qemuxml2xmltest.c
>>> @@ -1235,6 +1235,8 @@ mymain(void)
>>>      DO_TEST("riscv64-virt",
>>>              QEMU_CAPS_DEVICE_VIRTIO_MMIO);
>>>  
>>> +    DO_TEST("disk-metadata_cache_size", NONE);
>>> +
>>>      if (getenv("LIBVIRT_SKIP_CLEANUP") == NULL)
>>>          virFileDeleteTree(fakerootdir);
>>>  
>>>




More information about the libvir-list mailing list