[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Libguestfs] LVM flakey failures



On Mon, Apr 05, 2021 at 04:47:30PM +0300, Sam Eiderman wrote:
> We also looked at udev settle call points in the logs and it seems that it is
> called a lot of times before.
>
> The bug I mentioned is
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689 and they
> also discuss that maybe udev settle is not working as intended.

So I don't know, but it should be relatively easy to tell.  Firstly
you can modify appliance/init to add very verbose debugging to udev.
Uncomment --debug here:

https://github.com/libguestfs/libguestfs/blob/b18ac489db76a700f2168ae6eb64e9d450613a27/appliance/init#L107

Additionally or instead you could modify daemon/utils.c to do “ls -lR
/dev/” before and after the udevadm settle command, which should show
if the additional device nodes are present before and/or after the
settle command.  That would be a pretty good way to tell if udevadm
settle is having the effect we think it should.

> The kernel version of the appliance (as can be seen in the log) is 4.19
> 
> > Collecting the full logs is the right approach to diagnosing this.
> 
> I added the full log for the first failure logs, I think we can see from there
> that udev settle is called but the file does not exist yet.

Do you have the full logs from the second case?

> We thought that maybe if we explicitly add the following logic right after
> g.launch() it might help:
>
> 1. For each device returned by: lvm 'lvs' '-o' 'vg_name,lv_name' '-S' 'lv_role=
> public && lv_skip_activation!=yes' '--noheadings' '--separator' '/'
> 1.1. stat the device /dev/vg/lv
> 1.2. if stat fails on device does not exist - wait
> 1.3. Go back to 1
> 
> If we wait for too long, relaunch guestfs.

It'd be a bit of a hack.  Probably better to try to work out what's
going wrong first of all.  It should be possible to tell from the
kernel, udev and libguestfs logs.

Rich.

> However it would be nicer to maybe implement this inside guestfs.launch()
> itself
> 
> Sam
> 
> 
> On Mon, Apr 5, 2021 at 3:45 PM Richard W.M. Jones <rjones redhat com> wrote:
> 
>     On Mon, Apr 05, 2021 at 02:47:51PM +0300, Sam Eiderman wrote:
>     > Hi,
>     >
>     > We are experiencing very rare LVM failures - 2 failures so far, in
>     > different OSs, in different libguestfs functions.
>     >
>     > The first failure is inspect_os() not finding the root operating
>     > system on rhel7.4.
>     > LVM volumes are returned by lvm command but files under /dev do not exist
>     (yet?)
>     >
>     > Second failure is in is_lv() - is_lv() successfully enumerates all lvm
>     > volumes but then internal stat() command fails again on /dev file
>     > since it does not exist (yet?) (rhel8.0)
>     >
>     > All of our tests run in parallel, 1 guestfs instance per core on a 32
>     > core machine and they run on GCP (nested virtualization).
>     >
>     > What we think that is happening here is that libguestfs' appliance is
>     > booting very somewhat slower than usual and that the links to some
>     > devices do not appear yet (even after multiple seconds).
>     > We found this old issue that might be connected to this behavior (in
>     > some way): https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689
> 
>     I wonder if "udevadm settle" is not working?  The daemon will use this
>     command at various times in order to ensure that all preceeding udev
>     messages have been processed and all /dev changes have been made.
> 
>     It is called once at appliance boot:
> 
>       https://github.com/libguestfs/libguestfs/blob/
>     b18ac489db76a700f2168ae6eb64e9d450613a27/appliance/init#L109
> 
>     And throughout the daemon code:
> 
>       https://github.com/libguestfs/libguestfs/blob/
>     b18ac489db76a700f2168ae6eb64e9d450613a27/daemon/utils.c#L732
> 
>       $ git grep 'udev_settle ()' -- daemon
>       daemon/blockdev.c:  udev_settle ();
>       daemon/cryptsetup.ml:  udev_settle ()
>       daemon/cryptsetup.ml:  udev_settle ()
>       daemon/file.c:    udev_settle ();
>       daemon/guestfsd.c:  udev_settle ();
>       daemon/hotplug.c:    udev_settle ();
>       [etc etc]
> 
>     It could be that udev_settle is not being called at the right points,
>     or is not working in the way we understand.
> 
>     ...
>     > Short second failure logs (is_lv() only) - notice that is_lv() is
>     > invoked on /dev/vg_myvg/lv_var but it fails due to a problem in
>     > /dev/rhel/swap not existing)
>     >
>     > 2021-03-07 10:58:53 T libguestfs - 0 - enter - is_lv
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: =>
>     > aug_get (0x13) took 0.00 secs
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: <= is_lv
>     > (0x108) request length 64 bytes
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf:
>     > stdout=n stderr=y flags=0x0
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf: udevadm
>     > --debug settle -E /dev/vg_myvg/lv_var
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm 'lvs'
>     > '-o' 'vg_name,lv_name' '-S' 'lv_role=public &&
>     > lv_skip_activation!=yes' '--noheadings' '--separator' '/'
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm returned
>     0
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm: stdout:
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/root
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/swap
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - vg_myvg/lv_var
>     > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: error:
>     > stat: /dev/rhel/swap: No such file or directory
> 
>     You might want to look earlier in this log to see if udevadm settle
>     was called between the LVs being activated and this API function.  If
>     it was not being called then possibly we need to insert a call after
>     activation.  If it was being called then perhaps udev settle is not
>     working the way we understand it.
> 
>     Collecting the full logs is the right approach to diagnosing this.
> 
>     The only other issue I can think of is the change in kernel PCI device
>     enumeration code (starting in Linux 5.6,
>     https://bugzilla.redhat.com/show_bug.cgi?id=1804207).  I suppose in
>     theory the underlying devices might not be ready at all before we run
>     udev settle in the appliance.  However I have not seen this actually
>     happen.
> 
>     Rich.
> 
>     --
>     Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/
>     ~rjones
>     Read my programming and virtualization blog: http://rwmj.wordpress.com
>     libguestfs lets you edit virtual machines.  Supports shell scripting,
>     bindings from many languages.  http://libguestfs.org
> 
> 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]