[linux-lvm] intermittent transaction ID mismatch with rapid use of thin snapshots

Tue May 23 19:44:42 UTC 2023

Hi, I am seeing occasional hard to reproduce failures to lvcreate a thin LV,
with transaction ID mismatch errors.

The system is a self-managed compute node that uses thin LVs for base system
software and containers - each service has a separate thin LV as its
rootfs, and
the system takes a fresh thin snapshot of the installed contents at every
boot.

During system bring up we have two concurrent processes adding, deleting,
and
renaming thin LVs from a single thin pool:

1 - a login script that creates a thin snapshot of a minimal rootfs for each
user, then launches an LXC container with that rootfs, and leaves the user
in
bash running in that container. If any issues occur in any of that process,
it
will lvremove the snapshot and retry several times. Although it creates a
container, the script itself runs as sudo root, not inside any
container/namespaces.

2 - a software install service that takes system services packaged as
containers
and creates thin LVs based on the container image layer set, and then takes
an
additional thin snapshot to be mounted for the current boot. this last
snapshot
is multi-step, with an initial lvcreate of a temp name and a final lvrename.

Neither of these processes are in containers or on a VM with a shared
volume, so
they should be seeing the same LVM lock files, as far as I can tell.

This overall approach has been stable for a long time, but a recent change
has
caused these to overlap more frequently, and we are now seeing failures in
lvcreate with a transaction id mismatch when the install service tries to
create
its temporary LV - here's a snippet from one such log:

Error: lvm lvcreate --activate=y --setactivationskip=y
--ignoreactivationskip --name=tmp-extract-414e5ec83c02133eae2984ee4
25b22589bca058d --snapshot
vg_ifc0/5e9a280e11efbc75ed8f01bdd7b58559c373b451b72921555be5c2eaf93d27b2:
exit status 5:   /dev/sdh: open failed: No medium found
/dev/sdi: open failed: No medium found
/dev/sdj:ound
/dev/sdk: open failed: No medium found
/dev/sdh: open failed: No medium found
/dev/sdi: open failed: No medium found
/dev/sdj: open failed: No medium found
/dev/sdk: open failed: No medium fou
ThinDataLV-tpool (251:3) transaction_id is 147, while expected 148.
Failed to suspend vg_ifc0/ThinDataLV with queued messages.

Due to some failure recovery loops, these services are running
lvcreate/lvremove/lvrename (on same VG but different LVs) as often as 5
times
per second, which seems fast but doesn't seem like it should be a problem.

Looking through past messages to this list, it looks like previous cases
were
due to sharing volumes between containers/vms without a common lock dir,
which
we are not doing.

Any thoughts on how to further debug or avoid this issue?

I can provide the lvm metadata backup files if that would help - there are
a lot
of them, as once it starts failing, the system retries frequently.

Ihis is on Ubuntu 20.04, with lvm 2.03.07(2)
(ubuntu package version 2.03.07-1ubuntu1)
and a custom kernel built from 5.15.68.

Thanks!
-mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-lvm/attachments/20230523/4db932d7/attachment.htm>