[Libguestfs] LVM flakey failures
Richard W.M. Jones
rjones at redhat.com
Mon Apr 5 14:18:01 UTC 2021
On Mon, Apr 05, 2021 at 04:47:30PM +0300, Sam Eiderman wrote:
> We also looked at udev settle call points in the logs and it seems that it is
> called a lot of times before.
>
> The bug I mentioned is
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689 and they
> also discuss that maybe udev settle is not working as intended.
So I don't know, but it should be relatively easy to tell. Firstly
you can modify appliance/init to add very verbose debugging to udev.
Uncomment --debug here:
https://github.com/libguestfs/libguestfs/blob/b18ac489db76a700f2168ae6eb64e9d450613a27/appliance/init#L107
Additionally or instead you could modify daemon/utils.c to do “ls -lR
/dev/” before and after the udevadm settle command, which should show
if the additional device nodes are present before and/or after the
settle command. That would be a pretty good way to tell if udevadm
settle is having the effect we think it should.
> The kernel version of the appliance (as can be seen in the log) is 4.19
>
> > Collecting the full logs is the right approach to diagnosing this.
>
> I added the full log for the first failure logs, I think we can see from there
> that udev settle is called but the file does not exist yet.
Do you have the full logs from the second case?
> We thought that maybe if we explicitly add the following logic right after
> g.launch() it might help:
>
> 1. For each device returned by: lvm 'lvs' '-o' 'vg_name,lv_name' '-S' 'lv_role=
> public && lv_skip_activation!=yes' '--noheadings' '--separator' '/'
> 1.1. stat the device /dev/vg/lv
> 1.2. if stat fails on device does not exist - wait
> 1.3. Go back to 1
>
> If we wait for too long, relaunch guestfs.
It'd be a bit of a hack. Probably better to try to work out what's
going wrong first of all. It should be possible to tell from the
kernel, udev and libguestfs logs.
Rich.
> However it would be nicer to maybe implement this inside guestfs.launch()
> itself
>
> Sam
>
>
> On Mon, Apr 5, 2021 at 3:45 PM Richard W.M. Jones <rjones at redhat.com> wrote:
>
> On Mon, Apr 05, 2021 at 02:47:51PM +0300, Sam Eiderman wrote:
> > Hi,
> >
> > We are experiencing very rare LVM failures - 2 failures so far, in
> > different OSs, in different libguestfs functions.
> >
> > The first failure is inspect_os() not finding the root operating
> > system on rhel7.4.
> > LVM volumes are returned by lvm command but files under /dev do not exist
> (yet?)
> >
> > Second failure is in is_lv() - is_lv() successfully enumerates all lvm
> > volumes but then internal stat() command fails again on /dev file
> > since it does not exist (yet?) (rhel8.0)
> >
> > All of our tests run in parallel, 1 guestfs instance per core on a 32
> > core machine and they run on GCP (nested virtualization).
> >
> > What we think that is happening here is that libguestfs' appliance is
> > booting very somewhat slower than usual and that the links to some
> > devices do not appear yet (even after multiple seconds).
> > We found this old issue that might be connected to this behavior (in
> > some way): https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689
>
> I wonder if "udevadm settle" is not working? The daemon will use this
> command at various times in order to ensure that all preceeding udev
> messages have been processed and all /dev changes have been made.
>
> It is called once at appliance boot:
>
> https://github.com/libguestfs/libguestfs/blob/
> b18ac489db76a700f2168ae6eb64e9d450613a27/appliance/init#L109
>
> And throughout the daemon code:
>
> https://github.com/libguestfs/libguestfs/blob/
> b18ac489db76a700f2168ae6eb64e9d450613a27/daemon/utils.c#L732
>
> $ git grep 'udev_settle ()' -- daemon
> daemon/blockdev.c: udev_settle ();
> daemon/cryptsetup.ml: udev_settle ()
> daemon/cryptsetup.ml: udev_settle ()
> daemon/file.c: udev_settle ();
> daemon/guestfsd.c: udev_settle ();
> daemon/hotplug.c: udev_settle ();
> [etc etc]
>
> It could be that udev_settle is not being called at the right points,
> or is not working in the way we understand.
>
> ...
> > Short second failure logs (is_lv() only) - notice that is_lv() is
> > invoked on /dev/vg_myvg/lv_var but it fails due to a problem in
> > /dev/rhel/swap not existing)
> >
> > 2021-03-07 10:58:53 T libguestfs - 0 - enter - is_lv
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: =>
> > aug_get (0x13) took 0.00 secs
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: <= is_lv
> > (0x108) request length 64 bytes
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf:
> > stdout=n stderr=y flags=0x0
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf: udevadm
> > --debug settle -E /dev/vg_myvg/lv_var
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm 'lvs'
> > '-o' 'vg_name,lv_name' '-S' 'lv_role=public &&
> > lv_skip_activation!=yes' '--noheadings' '--separator' '/'
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm returned
> 0
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm: stdout:
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/root
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/swap
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - vg_myvg/lv_var
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: error:
> > stat: /dev/rhel/swap: No such file or directory
>
> You might want to look earlier in this log to see if udevadm settle
> was called between the LVs being activated and this API function. If
> it was not being called then possibly we need to insert a call after
> activation. If it was being called then perhaps udev settle is not
> working the way we understand it.
>
> Collecting the full logs is the right approach to diagnosing this.
>
> The only other issue I can think of is the change in kernel PCI device
> enumeration code (starting in Linux 5.6,
> https://bugzilla.redhat.com/show_bug.cgi?id=1804207). I suppose in
> theory the underlying devices might not be ready at all before we run
> udev settle in the appliance. However I have not seen this actually
> happen.
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/
> ~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines. Supports shell scripting,
> bindings from many languages. http://libguestfs.org
>
>
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/
More information about the Libguestfs
mailing list