[libvirt] [Qemu-devel] IO accounting overhaul

Mon Sep 1 13:38:12 UTC 2014

The Monday 01 Sep 2014 à 13:41:01 (+0200), Markus Armbruster wrote :
> Benoît Canet <benoit.canet at irqsave.net> writes:
> 
> > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> >> Cc'ing libvirt following Stefan's lead.
> >> 
> >> Benoît Canet <benoit.canet at irqsave.net> writes:
> >> 
> >> > Hi,
> >> >
> >> > I collected some items of a cloud provider wishlist regarding I/O accouting.
> >> 
> >> Feedback from real power-users, lovely!
> >> 
> >> > In a cloud I/O accouting can have 3 purpose: billing, helping the customers
> >> > and doing metrology to help the cloud provider seeks hidden costs.
> >> >
> >> > I'll cover the two former topic in this mail because they are the
> >> > most important
> >> > business wize.
> >> >
> >> > 1) prefered place to collect billing IO accounting data:
> >> > --------------------------------------------------------
> >> > For billing purpose the collected data must be as close as
> >> > possible to what the
> >> > customer would see by using iostats in his vm.
> >> 
> >> Good point.
> >> 
> >> > The first conclusion we can draw is that the choice of collecting
> >> > IO accouting
> >> > data used for billing in the block devices models is right.
> >> 
> >> Slightly rephrasing: doing I/O accounting in the block device models is
> >> right for billing.
> >> 
> >> There may be other uses for I/O accounting, with different preferences.
> >> For instance, data on how exactly guest I/O gets translated to host I/O
> >> as it flows through the nodes in the block graph could be useful.
> >
> > I think this is the third point that I named as metrology.
> > Basically it boils down to "Where are the hidden IO costs of the QEMU
> > block layer".
> 
> Understood.
> 
> >> Doesn't diminish the need for accurate billing information, of course.
> >> 
> >> > 2) what to do with occurences of rare events:
> >> > ---------------------------------------------
> >> >
> >> > Another point is that QEMU developpers agree that they don't know
> >> > which policy
> >> > to apply to some I/O accounting events.
> >> > Must QEMU discard invalid I/O write IO or account them as done ?
> >> > Must QEMU count a failed read I/O as done ?
> >> >
> >> > When discusting this with a cloud provider the following appears:
> >> > these decisions
> >> > are really specific to each cloud provider and QEMU should not
> >> > implement them.
> >> 
> >> Good point, consistent with the old advice to avoid baking policy into
> >> inappropriately low levels of the stack.
> >> 
> >> > The right thing to do is to add accouting counters to collect these events.
> >> >
> >> > Moreover these rare events are precious troubleshooting data so it's
> >> > an additional
> >> > reason not to toss them.
> >> 
> >> Another good point.
> >> 
> >> > 3) list of block I/O accouting metrics wished for billing and helping
> >> > the customers
> >> > -----------------------------------------------------------------------------------
> >> >
> >> > Basic I/O accouting data will end up making the customers bills.
> >> > Extra I/O accouting informations would be a precious help for the
> >> > cloud provider
> >> > to implement a monitoring panel like Amazon Cloudwatch.
> >> 
> >> These are the first two from your list of three purposes, i.e. the ones
> >> you promised to cover here.
> >> 
> >> > Here is the list of counters and statitics I would like to help
> >> > implement in QEMU.
> >> >
> >> > This is the most important part of the mail and the one I would like
> >> > the community
> >> > review the most.
> >> >
> >> > Once this list is settled I would proceed to implement the required
> >> > infrastructure
> >> > in QEMU before using it in the device models.
> >> 
> >> For context, let me recap how I/O accounting works now.
> >> 
> >> The BlockDriverState abstract data type (short: BDS) can hold the
> >> following accounting data:
> >> 
> >>     uint64_t nr_bytes[BDRV_MAX_IOTYPE];
> >>     uint64_t nr_ops[BDRV_MAX_IOTYPE];
> >>     uint64_t total_time_ns[BDRV_MAX_IOTYPE];
> >>     uint64_t wr_highest_sector;
> >> 
> >> where BDRV_MAX_IOTYPE enumerates read, write, flush.
> >> 
> >> wr_highest_sector is a high watermark updated by the block layer as it
> >> writes sectors.
> >> 
> >> The other three are *not* touched by the block layer.  Instead, the
> >> block layer provides a pair of functions for device models to update
> >> them:
> >> 
> >>     void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
> >>             int64_t bytes, enum BlockAcctType type);
> >>     void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie);
> >> 
> >> bdrv_acct_start() initializes cookie for a read, write, or flush
> >> operation of a certain size.  The size of a flush is always zero.
> >> 
> >> bdrv_acct_done() adds the operations to the BDS's accounting data.
> >> total_time_ns is incremented by the time between _start() and _done().
> >> 
> >> You may call _start() without calling _done().  That's a feature.
> >> Device models use it to avoid accounting some requests.
> >> 
> >> Device models are not supposed to mess with cookie directly, only
> >> through these two functions.
> >> 
> >> Some device models implement accounting, some don't.  The ones that do
> >> don't agree on how to count invalid guest requests (the ones not passed
> >> to block layer) and failed requests (passed to block layer and failed
> >> there).  It's a mess in part caused by us never writing down what
> >> exactly device models are expected to do.
> >> 
> >> Accounting data is used by "query-blockstats", and nothing else.
> >> 
> >> Corollary: even though every BDS holds accounting data, only the ones in
> >> "top" BDSes ever get used.  This is a common block layer blemish, and
> >> we're working on cleaning it up.
> >> 
> >> If a device model doesn't implement accounting, query-blockstats lies.
> >> Fortunately, its lies are pretty transparent (everything's zero) as long
> >> as you don't do things like connecting a backend to a device model that
> >> doesn't implement accounting after disconnecting it from a device model
> >> that does.  Still, I'd welcome a more honest QMP interface.
> >> 
> >> For me, this accounting data belongs to the device model, not the BDS.
> >> Naturally, the block device models should use common infrastructure.  I
> >> guess they use the block layer only because it's obvious infrastructure
> >> they share.  Clumsy design.
> >> 
> >> > /* volume of data transfered by the IOs */
> >> > read_bytes
> >> > write_bytes
> >> 
> >> This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE].
> >> 
> >> nr_bytes[BDRV_ACCT_FLUSH] is always zero.
> >> 
> >> Should this count only actual I/O, i.e. accumulated size of successful
> >> operations?
> >> 
> >> > /* operation count */
> >> > read_ios
> >> > write_ios
> >> > flush_ios
> >> >
> >> > /* how many invalid IOs the guest submit */
> >> > invalid_read_ios
> >> > invalid_write_ios
> >> > invalid_flush_ios
> >> >
> >> > /* how many io error happened */
> >> > read_ios_error
> >> > write_ios_error
> >> > flush_ios_error
> >> 
> >> This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE],
> >> nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed.
> >> 
> >> > /* account the time passed doing IOs */
> >> > total_read_time
> >> > total_write_time
> >> > total_flush_time
> >> 
> >> This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE],
> >> total_time_ns[BDRV_ACCT_FLUSH].
> >> 
> >> I guess this should count both successful and failed I/O.  Could throw
> >> in invalid, too, but it's probably too quick to matter.
> >> 
> >> Please specify the unit clearly.  Both total_FOO_time_ns or total_FOO_ns
> >> would work for me.
> >
> > Yes _ns is fine for me too.
> >
> >> 
> >> > /* since when the volume is iddle */
> >> > qvolume_iddleness_time
> >> 
> >> "idle"
> >> 
> >> The obvious way to maintain this information with the current could
> >> would be saving the value of get_clock() in bdrv_acct_done().
> >> 
> >> > /* the following would compute latecies for slices of 1 seconds then toss the
> >> >  * result and start a new slice. A weighted sumation of the instant latencies
> >> >  * could help to implement this.
> >> >  */
> >> > 1s_read_average_latency
> >> > 1s_write_average_latency
> >> > 1s_flush_average_latency
> >> >
> >> > /* the former three numbers could be used to further compute a 1
> >> > minute slice value */
> >> > 1m_read_average_latency
> >> > 1m_write_average_latency
> >> > 1m_flush_average_latency
> >> >
> >> > /* the former three numbers could be used to further compute a 1 hours
> >> > slice value */
> >> > 1h_read_average_latency
> >> > 1h_write_average_latency
> >> > 1h_flush_average_latency
> >> 
> >> This is something like "what we added to total_FOO_time in the last
> >> completed 1s / 1m / 1h time slice divided by the number of additions".
> >> Just another way to accumulate the same raw data, thus no worries.
> >> 
> >> > /* 1 second average number of requests in flight */
> >> > 1s_read_queue_depth
> >> > 1s_write_queue_depth
> >> >
> >> > /* 1 minute average number of requests in flight */
> >> > 1m_read_queue_depth
> >> > 1m_write_queue_depth
> >> >
> >> > /* 1 hours average number of requests in flight */
> >> > 1h_read_queue_depth
> >> > 1h_write_queue_depth
> >> 
> >> I guess this involves counting bdrv_acct_start() and bdrv_acct_done().
> >> The "you need not call bdrv_acct_done()" feature may get in the way.
> >> Solvable.
> >> 
> >> Permit me a short detour into the other use for I/O accounting I
> >> mentioned: data on how exactly guest I/O gets translated to host I/O as
> >> it flows through the nodes in the block graph.  Do you think this would
> >> be pretty much the same data, just collected at different points?
> >
> > That something I would like to take care in a further sub project.
> 
> I'm asking now because it impacts where I'd prefer the shared
> infrastructure to live.
> 
> > Optionally collecting the same data for each BDS of the graph.
> 
> If that's the case, keeping the shared infrastructure in the block layer
> makes sense.
> 
> BDS member acct then holds I/O stats for the BDS.  We currently use it
> for something else: I/O stats of the device model backed by this BDS.
> That needs to move elsewhere.  Two places come to mind:
> 
> 1. BlockBackend, when it's available (I resumed working on it last week
>    for a bit).  Superficially attractive, because it's close to what we
>    have now, but then we have to deal with what to do when the backend
>    gets disconnected from its device model, then connected to another
>    one.
> 
> 2. The device models that actually implement I/O accounting.  Since
>    query-blockstats names a backend rather than a device model, we need
>    a BlockDevOps callback to fetch the stats.  Fetch fails when the
>    callback is null.  Lets us distinguish "no stats yet" and "device
>    model can't do stats", thus permits a QMP interface that doesn't lie.
> 
> Right now, I like (2) better.

ok

> 
> >> > 4) Making this happen
> >> > -------------------------
> >> >
> >> > Outscale want to make these IO stat happen and gave me the go to do whatever
> >> > grunt is required to do so.
> >> > That said we could collaborate on some part of the work.
> >> 
> >> Cool!
> >> 
> >> A quick stab at tasks:
> >> 
> >> * QMP interface, either a compatible extension of query-blockstats or a
> >>   new one.
> >
> > I would like to extend query-blockstat in a first time but I also
> > would like to postpone the QMP interface changes and just write the
> > shared infrastructure and deploy it in the device models.
> 
> Implementing improved data collection need not wait for the QMP design.
> 
> >> * Rough idea on how to do the shared infrastructure.
> >
> > -API wize I think about adding
> > bdrv_acct_invalid() and
> > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
> 
> Complication: partial success.  Example:
> 
> 1. Guest requests a read of N sectors.
> 
> 2. Device model calls
>    bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT_READ)
> 
> 3. Device model examines the request, and deems it valid.
> 
> 4. Device model passes it to the block layer.
> 
> 5. Block layer does its thing, but for some reason only M < N sectors
>    can be read.  Block layer returns M.
> 
> 6. What's the device model to do now?  Both bdrv_acct_failed() and
>    bdrv_acct_done() would be wrong.
> 
>    Should the device model account for a read of size M?  This ignores
>    the partial failure.
> 
>    Should it split the read into a successful and a failed part for
>    accounting purposes?  This miscounts the number of reads.

maybe bdrv_acct_partial() accounting the size of data read in the bandwith counter
and keeping care of counting this event.

Maybe we will discover some other rare event to account.

> 
> > -To calculate the averages I think about a global timer firing every seconds
> > and iterating of the bds list to make the computations even when there is no
> > IO activity. Is it acceptable to have a qemu_mutex by statitic structure ?
> 
> Let me try to sketch something simpler.  For brevity, I cover only a
> single slice, but adding mroe is just more of the same.
> 
> Have an end-of-slice timestamp.
> 
> When accounting an operation, compare current time to end-of-slice (we
> already get the current time to measure latency, so this is cheap).  If
> end-of-slice is in the past, zero the slice's stats, and update
> end-of-slice.
> 
> When something asks for stats, do the same, to avoid returning stale
> slice stats.
> 
> >> * Implement (can be split up into several tasks if desired)
> >
> > First I would like to write a series implementing a
> > backward-compatible API and get it
> > merged.
> >
> > Then the deployment of the new API specifics in the devices models can
> > be splitted/parallelized.
> 
> Works for me.
> 
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list