[linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg

Tue Sep 25 16:44:06 UTC 2018

On Tue, Sep 25, 2018 at 06:18:53PM +0800, Damon Wang wrote:
> Hi,
> 
>   AFAIK once sanlock can not access lease storage, it will run
> "kill_vg" to lvmlockd, and the standard process should be deactivate
> logical volumes and drop vg locks.
> 
>   But sometimes the storage will recovery after kill_vg(and before we
> deactivate or drop lock), and then it will prints "storage failed for
> sanlock leases" on lvm commands like this:
> 
> [root at dev1-2 ~]# vgck 71b1110c97bd48aaa25366e2dc11f65f
>   WARNING: Not using lvmetad because config setting use_lvmetad=0.
>   WARNING: To avoid corruption, rescan devices to make changes visible
> (pvscan --cache).
>   VG 71b1110c97bd48aaa25366e2dc11f65f lock skipped: storage failed for
> sanlock leases
>   Reading VG 71b1110c97bd48aaa25366e2dc11f65f without a lock.
> 
>   so what should I do to recovery this, (better) without affect
> volumes in using?
> 
>   I find a way but it seems very tricky: save "lvmlockctl -i" output,
> run lvmlockctl -r vg and then activate volumes as the previous output.
> 
>   Do we have an "official" way to handle this? Since it is pretty
> common that when I find lvmlockd failed, the storage has already
> recovered.

Hi, to figure out that workaround, you've probably already read the
section of the lvmlockd man page: "sanlock lease storage failure", which
gives some background about what's happening and why.  What the man page
is missing is some help about false failure detections like you're seeing.

It sounds like io delays from your storage are a little longer than
sanlock is allowing for.  With the default 10 sec io timeout, sanlock will
initiate recovery (kill_vg in lvmlockd) after 80 seconds of no successful
io from the storage.  After this, it decides the storage has failed.  If
it's not failed, just slow, then the proper way to handle that is to
increase the timeouts.  (Or perhaps try to configure the storage to avoid
such lengthy delays.)  Once a failure is detected and recovery is begun,
there's not an official way to back out of it.

You can increase the sanlock io timeout with lvmlockd -o <seconds>.
sanlock multiplies that by 8 to get the total length of time before
starting recovery.  I'd look at how long your temporary storage outages
last and set io_timeout so that 8*io_timeout will cover it.

Dave