[linux-lvm] lvconvert --uncache takes hours

Thu Mar 2 11:27:32 UTC 2023

On Thu, Mar 2, 2023 at 2:34 AM Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:
>
>
> ----- Original Message -----
> > From: "Roger Heflin" <rogerheflin at gmail.com>
> > To: "linux-lvm" <linux-lvm at redhat.com>
> > Cc: "Malin Bruland" <malin.bruland at pm.me>
> > Sent: Thursday, 2 March, 2023 01:51:08
> > Subject: Re: [linux-lvm] lvconvert --uncache takes hours
>
> > On Wed, Mar 1, 2023 at 4:50 PM Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:
> >>
> >> Hi all
> >>
> >> Working with a friend's machine, it has lvmcache turned on with writeback. This
> >> has worked well, but now it's uncaching and it takes *hours*. The amount of
> >> cache was chosen to 100GB on an SSD not used for much else and the dataset that
> >> is being cached, is a RAID-6 set of 10x2TB with XFS on top. The system mainly
> >> works with file serving, but also has some VMs that benefit from the caching
> >> quite a bit. But then - I wonder - how can it spend hours emptying the cache
> >> like this? Most write caching I know of last only seconds or perhaps in really
> >> worst case scenarios, minutes. Since this is taking hours, it looks to me
> >> something should have been flushed ages ago.
> >>
> >> Have I (or we) done something very stupid here or is this really how it's
> >> supposed to work?
> >>
> >> Vennlig hilsen
> >>
> >> roy
> >
> > A spinning raid6 array is slow on writes (see raid6  write penalty).
> > Because of that the array can only do about 100 write operattions/sec.
>
> About 100 writes/second per data drive, that is. md parallilses I/O well.
>

No.  On writes you get 100 writes to the raid6 total.  With reads you
get 100 iops/disk.  The writes by their very raid6 nature cannot be
parallalized.

Each write to md requires a lot of work.   At min, you have to re-read
the sector you are writing, read the parity you need to update,
calculate the parity changes, and , adjust the parity and re-write any
parities that you need to change.    Your other option is you might be
able to write an entire stripe, but that requires writes to all disks
+ parity calc + writes to parity.    All options of writing data to
raid5/6 breakdown to iops/disk == total write iops.
The raid5/6 format requires the multiple reads and writes, and  really
makes it slow on writes.

> > If the disk is doing other work then it only has the extra capacity so
> > it could destage slower.
>
> The system was mostly idle.
>
> > A lot depends on how big each chunk is.     The lvmcache indicates the
> > smallest chunksize is 32k.
> >
> > 100G / 32k = 3 million, and at 100seeks/sec that comes to at least an hour.
>
> Those 100GB was on SSD, not spinning rust. Last I checked, that was the whole point with caching.

You are de-staging the SSD cache to spinning disks. correct?  The
writes to spinning disks are slow.

>
> > Lvm bookkeeping has to also be written to the spinning disks I would
> > think, so 2 hours if the array were idle.
>
> erm - why on earth would you do writes to hdd if you're caching it?

Once the cache is gone all LVM should be on the spinning disks.

>
> > Throw in a 50% baseload on the disks and you get 4 hours.
> >
> > Hours is reasonable.
>
> As I said, the system was idle.
>
> Vennlig hilsen
>