[linux-lvm] Discussion: performance issue on event activation mode

Thu Sep 30 15:32:15 UTC 2021

On 9/30/21 7:41 PM, Peter Rajnoha wrote:
> On 9/30/21 10:07, heming.zhao at suse.com wrote:
>> On 9/30/21 3:51 PM, Martin Wilck wrote:
>>> On Thu, 2021-09-30 at 00:06 +0200, Peter Rajnoha wrote:
>>>> On Tue 28 Sep 2021 12:42, Benjamin Marzinski wrote:
>>>>> On Tue, Sep 28, 2021 at 03:16:08PM +0000, Martin Wilck wrote:
>>>>>> I have pondered this quite a bit, but I can't say I have a
>>>>>> concrete
>>>>>> plan.
>>>>>>
>>>>>> To avoid depending on "udev settle", multipathd needs to
>>>>>> partially
>>>>>> revert to udev-independent device detection. At least during
>>>>>> initial
>>>>>> startup, we may encounter multipath maps with members that don't
>>>>>> exist
>>>>>> in the udev db, and we need to deal with this situation
>>>>>> gracefully. We
>>>>>> currently don't, and it's a tough problem to solve cleanly. Not
>>>>>> relying
>>>>>> on udev opens up a Pandora's box wrt WWID determination, for
>>>>>> example.
>>>>>> Any such change would without doubt carry a large risk of
>>>>>> regressions
>>>>>> in some scenarios, which we wouldn't want to happen in our large
>>>>>> customer's data centers.
>>>>>
>>>>> I'm not actually sure that it's as bad as all that. We just may
>>>>> need a
>>>>> way for multipathd to detect if the coldplug has happened.  I'm
>>>>> sure if
>>>>> we say we need it to remove the udev settle, we can get some method
>>>>> to
>>>>> check this. Perhaps there is one already, that I don't know about.
>>>>> If
>>>>
>>>> The coldplug events are synthesized and as such, they all now contain
>>>> SYNTH_UUID=<UUID> key-value pair with kernel>=4.13:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-uevent
>>>>
>>>> I've already tried to proposee a patch for systemd/udev that would
>>>> mark
>>>> all uevents coming from the trigger (including the one used at boot
>>>> for
>>>> coldplug) with an extra key-value pair that we could easily match in
>>>> rules,
>>>> but that was not accepted. So right now, we could detect that
>>>> synthesized uevent happened, though we can't be sure it was the
>>>> actual
>>>> udev trigger at boot. For that, we'd need the extra marks. I can give
>>>> it
>>>> another try though, maybe if there are more people asking for this
>>>> functionality, we'll be at better position for this to be accepted.
>>>
>>> That would allow us to discern synthetic events, but I'm unsure how
>>> this what help us. Here, what matters is to figure out when we don't
>>> expect any more of them to arrive.
>>>
>>> I guess it would be possible to compare the list of (interesting)
>>> devices in sysfs with the list of devices in the udev db. For
>>> multipathd, we could
>>>
>>>   - scan set U of udev devices on startup
>>>   - scan set S of sysfs devices on startup
>>>   - listen for uevents for updating both S and U
>>>   - after each uevent, check if the difference set of S and U is emtpy
>>>   - if yes, coldplug has finished
>>>   - otherwise, continue waiting, possibly until some timeout expires.
>>>
>>> It's more difficult for LVM because you have no daemon maintaining
>>> state.
>>>
>>
>> Another performance story:
>> The legacy lvm2 (2.02.xx) with lvmetad daemon, the event-activation mode
>> is very likely timeout on a large scale PVs.
>> When customer met this issue, we suggested them to disable lvmetad.
> 
> We've already dumped lvmetad. Has this also been an issue with lvm versions without lvmetad, but still using the event-activation mode? (...the lvm versions where instead of lvmetad, we use the helper files under /run/lvm to track the state of incoming PVs and VG completeness)
> 
> Also, when I tried bootup with over 1000 devices in place (though in a VM, I don't have access to real machine with so many devices), I've noticed a performance regression in libudev itself with the interface to enumerate devices (which is the default obtain_device_list_from_udev=1 in lvm.conf):
> https://bugzilla.redhat.com/show_bug.cgi?id=1986158
> 
> It's very important to measure what's exactly causing the delays. And also important how we measure it - I'm not that trustful to systemd-analyze blame as it's very misty of what it is actually measuring.
> 
> I just want to say that some of the issues might simply be regressions/issues with systemd/udev that could be fixed. We as providers of block device abstractions where we need to handle, sometimes, thousands of devices, might be the first ones to hit these issues.
> 

The rhel8 callgrind picture (https://prajnoha.fedorapeople.org/bz1986158/rhel8_libudev_critical_cost.png)
responds to my analysis:
https://listman.redhat.com/archives/linux-lvm/2021-June/msg00022.html
handle_db_line took too much time and become the hotspot.

Heming