[linux-lvm] clvmd on cman waits forever holding the P_#global lock on node re-join

Thu Dec 13 10:04:40 UTC 2012

Dne 13.12.2012 00:14, Dmitry Panov napsal(a):
> Hi everyone,
>
> I've been testing clvm recently and noticed that often the operations are
> blocked when a node rejoins the cluster after being fenced or power cycled.
> I've done some investigation and found a number of issues relating to clvm.
> Here is what's happening:
>
>
> - When a node is fenced there is no "port closed" message sent to clvm which
> means the node id remains in the updown hash, although the node itself is
> removed from the nodes list after a "configuration changed" message is received.
>
> - Then, when the node rejoins, another "configuration changed" message arrives
> but because the node id is still in the hash, it is assumed that clvmd on that
> node is running even though it might not be the case yet (in my case clvmd is
> a pacemaker resource so it takes a couple of seconds before it's started).
>
> - This causes the expected_replies count set to a higher number than it should
> be, and as a result there are never enough replies received.
>
> - There is a problem with handling of the cmd_timeout which appears to be
> fixed today (what a coincidence!) by this patch:
> https://www.redhat.com/archives/lvm-devel/2012-December/msg00024.html The
> reason why I was hitting this bug is because I'm using Linux Cluster
> Management Console which polls LVM often enough so that the timeout code never
> ran. I have
> fixed this independently and even though my efforts are now probably wasted
> I'm attaching a patch for your consideration. I believe it enforces the
> timeout more strictly.
>
> Now, the questions:
>
> 1. If the problem with stuck entry in the updown hash is fixed it will cause
> operations to fail until clvmd is started on the re-joined node. Is there any
> particular reason for making them fail? Is it to avoid a race condition when
> newly started clvmd might not receive a message generated by an 'old' node?
>
> 2. The current expected_replies counter seems a bit flawed to me because it
> will fail if a node leaves the cluster before it sends a reply. Should it be
> handled differently? For example instead of a simple counter we could have a
> list of nodes which should be updated when a node leaves the cluster.
>

Hmmm this rather looks like a logical problem either in
the if() expression in (select_status == 0) branch,
or somehow 'magical' gulm fix applied in 2005 for add_to_lvmqueue()
should be running not just when message arrives.

Both patches seems to be not fixing the bug, but rather trying to go around 
broken logic in the main loop - it will need some thinking.

Zdenek