[linux-lvm] lvremove failing

Thu Sep 5 01:50:29 UTC 2013

Howdy -

I'm trying to get to the bottom of a nasty bug which is affecting our
production servers.

First, the problem. What we observe is that servers eventually fail
during lvremove, like so:

device-mapper: message ioctl on failed: Operation not permitted
 Unable to deactivate open lxc-lxc--pool_tdata (252:1)
 Unable to deactivate open lxc-lxc--pool_tmeta (252:0)
 Failed to deactivate lxc-lxc--pool-tpool
 Failed to resume lxc-pool.
 Failed to update thin pool lxc-pool.

Subsequent lvremove attempts fail ("One or more specified logical
volume(s) not found.") and subsequent attempts to lvcreate new
snapshots with the same origin fail similarly:

device-mapper: message ioctl on failed: Input/output error
 Unable to deactivate open lxc-lxc--pool_tdata (252:1)
 Unable to deactivate open lxc-lxc--pool_tmeta (252:0)
 Failed to deactivate lxc-lxc--pool-tpool
 Failed to resume lxc-pool.

At the same time, we see scary-looking device-mapper and filesystem
errors in syslog:

kernel: [23888.424530] Buffer I/O error on device dm-9, logical block 0
kernel: [23888.443368] attempt to access beyond end of device
kernel: [23888.497838] device-mapper: thin: process_bio:
dm_thin_find_block() failed: error = -5
kernel: [23888.550378] attempt to access beyond end of device

and:

kernel: [24123.428600] attempt to access beyond end of device
kernel: [24123.428843] attempt to access beyond end of device
kernel: [24123.428942] attempt to access beyond end of device
kernel: [24123.440876] attempt to access beyond end of device
kernel: [24123.442232] dm-0: rw=0, want=2150520, limit=491520)

I have not (so far) been able to reproduce this problem in isolation,
which is extremely frustrating... I'm hoping someone here will have a
clue what might be going on.

More information: the servers are ubuntu 13.04 (linux 3.8.0-29-generic) and lvm:
  LVM version:     2.02.98(2) (2012-10-15)
  Library version: 1.02.77 (2012-10-15)
  Driver version:  4.23.1

We had the same problems with LVM 2.02.95 (the one ubuntu packages for
raring) and we now build 2.02.98 from source, but the problem
persists.

Also interesting: this problem first appeared when we upgraded from
ubuntu 12.04 (lvm 2.02.66) to 13.04 (lvm 2.02.95). We haven't changed
the way we create/destroy volumes. (It is plausible that the problem
existed before the upgrades, but with very very different
symptoms...?)

Speaking of which, here's what we do:

(stuff to make a tmpfs-backed block device in /dev/loop0)

pvcreate /dev/loop0
vgcreate lxc /dev/loop0
lvcreate --extents "99%VG" --poolmetadatasize "240M" --thinpool lxc-pool lxc
lvcreate --name slave-image --virtualsize "20GB" --thin lxc/lxc-pool

(stuff to populate an ext4 filesystem into slave-image)

resize2fs /dev/lxc/slave-image
lvchange --permission r lxc/slave-image

... and then many many many instances of:

sync
lvcreate --name box${n} --snapshot lxc/lxc-pool
mkdir -p /mnt/box${n}
mount /dev/lxc/box${n} /mnt/box${n} -o noatime

(stuff to start lxc container mounting /mnt/box${n} and run arbitrary
code inside the lxc container... then, some minutes later, shut down
lxc and...)

umount -l /mnt/box${n}
lvremove -f /dev/lxc/box${n}

We do this several thousand times daily across dozens of servers.
About 2-3 times/day, we see the errors I originally described.

So, questions... is this a reasonable place to ask? Any ideas what
might be going wrong, or how I could go about reproducing the issue?
Any glaring flaws in the way we manage the volumes? Any further
information I can provide, or diagnostics I can run, or... well,
anything?

Thanks,
David Lowe