[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[libvirt-users] Researching why different cache modes result in 'some' guest filesystem corruption..

Hi All,

I've been chasing down an issue in recent weeks (my own lab, so no prod here) and I'm reaching out in case someone might have some guidance to share.

I'm running fairly large VMs (RHOSP underclouds - 8vcpu, 32gb ram, about 200gb single disk as a growable qcow2) on some RHEL7.6 hypervisors (kernel 3.10.0-927.2x.y, libvirt 4.5.0, qemu-kvm-1.5.3) on top of SSD/NVMe drives with various filesystems (vxfs, zfs, etc..) and using ECC RAM.

The issue can be described as follows:

- the guest VMs work fine for a while (days, weeks) but after a kernel
  update (z-stream) comes in, I am often greeted by the following message
  immediately after rebooting (or attempting to reboot into the new

"error: not a correct xfs inode"

- booting the previous kernel works fine and re-generating the initramfs
  for the new kernel (from the n-1 kernel) does not solve anything.

- if booted from an ISO, xfs_repair does not find errors.

- on ext4, there seems to be some kind of corruption there too.

I'm building the initial guest image qcow2 for those guest VMs this way:

1) start with a rhel-guest image (currently rhel-server-7.6-update-5-x86_64-kvm.qcow2)

2) convert to LVM by doing this:
 qemu-img create -f qcow2 -o preallocation=metadata,cluster_size=1048576,lazy_refcounts=off final_guest.qcow2 512G
 virt-format -a final_guest.qcow2 --partition=mbr --lvm=/dev/rootdg/lv_root --filesystem=xfs
 guestfish --ro -a rhel_guest.qcow2 -m /dev/sda1 -- tar-out / - | \
 guestfish --rw -a final_guest.qcow2 -m /dev/rootdg/lv_root  -- tar-in - /

3) use "final_guest.qcow2" as the basis for my guests with LVM.

After chasing down this issue some more and attempting various things (build the image on Fedora29, build a real XFS filesystem inside a VM and use the generated qcow2 as a basis instead of virt-format)..

..I've noticed that the SATA disk of each of those guests were using 'directsync' (instead of 'Hypervisor Default'). As soon as I switched to 'None', the XFS issues disappeared and I've now applied several consecutive kernel updates without issues. Also, 'directsync' or 'writethrough', while providing decent performance, both exhibited the XFS 'corruption' behaviour.. Only 'none' seem to have solved that.

I've read the docs but I thought it was OK to use those modes (UPS, Battery-Backed RAID, etc..)

Does anyone have any idea what's going on or what I may be doing wrong?

Thanks for reading,


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]