[linux-lvm] Discussion: performance issue on event activation mode

Mon Jun 7 16:40:01 UTC 2021

Hi Heming,

Thanks for the analysis and tying things together for us so clearly, and I
like the ideas you've outlined.

On Sun, Jun 06, 2021 at 02:15:23PM +0800, heming.zhao at suse.com wrote:
> I send this mail for a well known performance issue:
>  when system is attached huge numbers of devices. (ie. 1000+ disks),
>  the lvm2-pvscan at .service costs too much time and systemd is very easy to
>  time out, and enter emergency shell in the end.
> 
> This performance topic had been discussed in there some times, and the issue was
> lasting for many years. From the lvm2 latest code, this issue still can't be fix
> completely. The latest code add new function _pvscan_aa_quick(), which makes the
> booting time largely reduce but still can's fix this issue utterly.
> 
> In my test env, x86 qemu-kvm machine, 6vcpu, 22GB mem, 1015 pv/vg/lv, comparing
> with/without _pvscan_aa_quick() code, booting time reduce from "9min 51s" to
> "2min 6s". But after switching to direct activation, the booting time is 8.7s
> (for longest lvm2 services: lvm2-activation-early.service).

Interesting, it's good to see the "quick" optimization is so effective.
Another optimization that should be helping in many cases is the
"vgs_online" file which will prevent concurrent pvscans from all
attempting to autoactivate a VG.

> The hot spot of event activation is dev_cache_scan, which time complexity is
> O(n^2). And at the same time, systemd-udev worker will generate/run
> lvm2-pvscan at .service on all detecting disks. So the overall is O(n^3).
> 
> ```
> dev_cache_scan //order: O(n^2)
>  + _insert_dirs //O(n)
>  | if obtain_device_list_from_udev() true
>  |   _insert_udev_dir //O(n)
>  |
>  + dev_cache_index_devs //O(n)
> 
> There are 'n' lvm2-pvscan at .service running: O(n)
> Overall: O(n) * O(n^2) => O(n^3)
> ```

I knew the dev_cache_scan was inefficient, but didn't realize it was
having such a negative impact, especially since it isn't reading devices.
Some details I'm interested to look at more closely (and perhaps you
already have some answers here):

1. Does obtain_device_list_from_udev=0 improve things?  I recently noticed
that 0 appeared to be faster (anecdotally), and proposed we change the
default to 0 (also because I'm biased toward avoiding udev whenever
possible.)

2. We should probably move or improve the "index_devs" step; it's not the
main job of dev_cache_scan and I suspect this could be done more
efficiently, or avoided in many cases.

3. pvscan --cache is supposed to be scalable because it only (usually)
reads the single device that is passed to it, until activation is needed,
at which point all devices are read to perform a proper VG activation.
However, pvscan does not attempt to reduce dev_cache_scan since I didn't
know it was a problem.  It probably makes sense to avoid a full
dev_cache_scan when pvscan is only processing one device (e.g.
setup_device() rather than setup_devices().)

> Question/topic:
> Could we find out a final solution to have a good performance & scale well under
> event-based activation?

First, you might not have seen my recently added udev rule for
autoactivation, I apologize it's been sitting in the "dev-next" branch
since we've not figured out a good a branching strategy for this change.
We just began getting some feedback on this change last week:

https://sourceware.org/git/?p=lvm2.git;a=blob;f=udev/69-dm-lvm.rules.in;h=03c8fbbd6870bbd925c123d66b40ac135b295574;hb=refs/heads/dev-next

There's a similar change I'm working on for dracut:
https://github.com/dracutdevs/dracut/pull/1506

Each device uevent still triggers a pvscan --cache, reading just the one
device, but when a VG is complete, the udev rule runs systemd-run vgchange
-aay VG.  Since it's not changing dev_cache_scan usage, the issues you're
describing will still need to be looked at.

> Maybe two solutions (Martin & I discussed):
> 
> 1. During boot phase, lvm2 automatically swithes to direct activation mode
> ("event_activation = 0"). After booted, switch back to the event activation mode.
> 
> Booting phase is a speical stage. *During boot*, we could "pretend" that direct
> activation (event_activation=0) is set, and rely on lvm2-activation-*.service
> for PV detection. Once lvm2-activation-net.service has finished, we could
> "switch on" event activation.
> 
> More precisely: pvscan --cache would look at some file under /run,
> e.g. /run/lvm2/boot-finished, and quit immediately if the file doesn't exist
> (as if event_activation=0 was set). In lvm2-activation-net.service, we would add
> something like:
> 
> ```
> ExecStartPost=/bin/touch /run/lvm2/boot-finished
> ```
> 
> ... so that, from this point in time onward, "pvscan --cache" would _not_ quit
> immediately any more, but run normally (assuming that the global
> event_activation setting is 1). This way we'd get the benefit of using the
> static activation services during boot (good performance) while still being able
> to react to udev events after booting has finished.
> 
> This idea would be worked out with very few code changes.
> The result would be a huge step forward on booting time.

This sounds appealing to me, I've always found it somewhat dubious how we
pretend each device is newly attached, and process it individually, even
if all devices are already present.  We should be taking advantage of the
common case when many or most devices are already present, which is what
you're doing here.  Async/event-based processing has it's place, but it's
surely not always the best answer.  I will think some more about the
details of how this might work, it seems promising.

> 2. change lvm2-pvscan at .service running mode from parallel to serival.
> 
> This idea looks a little weird, it goes the opposite trend of today's
> programming technologies: parallel programming on multi-cores.
> 
> idea:
> the action of lvm2 scaning "/dev" is hard to change, the outside parallel
> lvm2-pvscan at .service could change from parallel to serial.
>
> For example, a running pvscan instance could set a "running" flag in tmpfs (ie.
> /run/lvm/) indicating that no other pvscan process should be called in parallel.
> If another pvscan is invoked and sees "running", it would create a "pending"
> flag, and quit. Any other pvscan process seeing the "pending" flag would
> just quit. If the first instance sees the "pending" flag, it would
> atomically remove "pending" and restart itself, in order to catch any device
> that might have appeared since the previous sysfs scan.
> In most condition, devices had been found by once pvscan scanning,
> then next time of pvscan scanning should work with order O(n), because the
> target device had been inserted internal cache tree already. and on overall,
> there is only a single pvscan process would be running at any given time.
> 
> We could create a list of pending to-be-scanned devices then (might be directory
> entries in some tmpfs directory). On exit, pvscan could check this dir and
> restart if it's non-empty.

The present design is based on pvscan --cache reading only the one device
that has been attached, and I think that's good.  I'd expect that also
lends itself to running pvscans in parallel, since they are all reading
different devices.  If it's just dev_cache_scan that needs optimizing, I
expect there are better ways to do that than adding serialization.  This
is also related to the number of udev workers as mentioned in the next
email.  So I think we need to narrow down the problem a little more before
we know if serializing is going to be the right answer, or where/how to do
it.

Dave