[linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg

Fri Sep 28 03:14:35 UTC 2018

On Fri, Sep 28, 2018 at 1:35 AM David Teigland <teigland at redhat.com> wrote:
>
> On Thu, Sep 27, 2018 at 10:12:44PM +0800, Damon Wang wrote:
> > Thank you for your reply, I have another question under such
circumstances.
> >
> > I usually run "vgck" to check weather vg is good, but sometimes it
> > seems it stuck, and leave a VGLK on sanlock. (I'm sure io error will
> > cause it, but sometimes not because io error)
> > Then i'll try use sanlock client release -r xxx to release it, but it
> > also sometimes not work.(be stuck)
> > Then I may lvmlockctl -r to drop vg lockspace, but it still may stuck,
> > and I'm io is ok when it stuck
> >
> > This usually happens on multipath storage, I consider multipath will
> > queue some io is blamed, but not sure.
> >
> > Any idea?
>
> First, you might be able to avoid this issue by doing the check using
> something other than an lvm command, or perhaps and lvm command configured
> to avoid taking locks (the --nolocking option in vgs/pvs/lvs).  What's
> appropriate depends on specifically what you want to know from the check.
>

This is how I use sanlock and lvmlockd:

 +------------------+            +---------------------+
 +----------------+
 |                  |            |                     |         |
      |
 |     sanlock      <------------>     lvmlockd        <---------+  lvm
commands  |
 |                  |            |                     |         |
      |
 +------------------+            +---------------------+
 +----------------+
       |
       |
       |
       |      +------------------+
 +-----------------+        +------------+
       |      |                  |                               |
       |        |            |
       +------>     multipath    <- - -  -  -  -   -  -  -  -  - |  lvm
volumes    <--------+    qemu    |
              |                  |                               |
       |        |            |
              +------------------+
 +-----------------+        +------------+
                      |
                      |
                      |
                      |
                      |
              +------------------+
              |                  |
              |   san storage    |
              |                  |
              |                  |
              +------------------+

As I mentioned in first mail, sometimes I found lvm commands failed with
"sanlock lease storage failure", I guess this is because lvmlockd kill_vg
has triggered,
as the manual says, it should deactivate volumes and drop lockspace as
quick as possible, but I can't get a proper alert from a program way.

TTY can get a message, but it's not a good way to listen or monitor, so I
run vgck periodically and parse its stdout and stderr, once "sanlock lease
storage failure" or
something unusual happens, an alert will be triggered and I'll do some
check(I hope all this process can be automatically).

If do not require lock(pvs/lvs/vgs --nolocking), these error wont be
noticed, since lots of san storage configure multipath as queue io as far
as possible(multipath -t | grep queue_if_no_path),
get lvm error at early is pretty difficult, vgck and parse its output a way
with less load(it will get a shared vglk) and better efficiency(it should
take less than 0.1s in usual) after various tried.

As you mentioned, I'll extend io timeout to avoid storage jitter, and I
believe it also resolves some problems from multipath queue io.

> I still haven't fixed the issue you found earlier, which sounds like it
> could be the same or related to what you're describing now.
> https://www.redhat.com/archives/linux-lvm/2018-July/msg00011.html
>
> As for manually cleaning up a stray lock using sanlock client, there may
> be some limits on the situations that works in, I don't recall off hand.
> You should try using the -p <pid> option with client release to match the
> pid of lvmlockd.

yes I added -p to release lock, and I wanna summary up an "Emergency
Procedures" for deal with different storage failure, for me it's still
unclear now.
I'll do more experiment after fix these annoying storage fails, then make
this summary

> Configuring multipath to fail more quickly instead of queueing might give
> you a better chance of cleaning things up.
>
> Dave
>

yeah, I believe multipath queue io should be blamed, I'm negotiating with
storage vendor since they think multipath config is right :-(

Thank you for your patience!

Damon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-lvm/attachments/20180928/f6e7b262/attachment.htm>