[Linux-cluster] CLVM/GFS will not mount or communicate with cluster

Tue Dec 5 14:34:31 UTC 2006

Barry Brimer wrote:
> 
> 
> On Mon, 4 Dec 2006, Robert Peterson wrote:
> 
>> Barry Brimer wrote:
>>> This is a repeat of the post I made a few minutes ago.  I thought
>>> adding a
>>> subject would be helpful.
>>>
>>>
>>> I have a 2 node cluster for a shared GFS filesystem.  One of the
>>> nodes fenced
>>> the other, and the node that got fenced is no longer able to
>>> communicate with
>>> the cluster.
>>>
>>> While booting the problem node, I receive the following error message:
>>> Setting up Logical Volume Management:  Locking inactive: ignoring
>>> clustered
>>> volume group vg00
>>>
>>> I have compared /etc/lvm/lvm.conf files on both nodes.  They are
>>> identical. The
>>> disk (/dev/sda1) is listed when typing "fdisk -l"
>>>
>>> There are no iptables firewalls active (although
>>> /etc/sysconfig/iptables exists,
>>> iptables is chkconfig'd off).  I have written a simple iptables
>>> logging rule
>>> (iptables -I INPUT -s <problem node> -j LOG) on the working node to
>>> verify that
>>> packets are reaching the working node, but no messages are being
>>> logged in
>>> /var/log/messages on the working node that acknowledge any cluster
>>> activity
>>> from the problem node.
>>>
>>> Both machines have the same RH packages installed and are mostly up
>>> to date,
>>> they are missing the same packages, none of which involve the kernel,
>>> RHCS or
>>> GFS.
>>>
>>> When I boot the problem node, it successfully starts ccsd, but it
>>> fails after a
>>> while on cman and fails after a while on fenced.  I have given the clvmd
>>> process an hour, and it still will not start.
>>>
>>> vgchange -ay on the problem node returns:
>>>
>>> # vgchange -ay
>>>   connect() failed on local socket: Connection refused
>>>   Locking type 2 initialisation failed.
>>>
>>> I have the contents of /var/log/messages on the working node and the
>>> problem
>>> node at the time of the fence, if that would be helpful.
>>>
>>> Any help is greatly appreciated.
>>>
>>> Thanks,
>>> Barry
>>>
>> Hi Barry,
>>
>> Well, vgchange and other lvm functions won't work on the clustered volume
>> unless clvmd is running, and clvmd won't run properly until the node
>> is talking
>> happily through the cluster infrastructure.  So as I see it, your
>> problem is that
>> cman is not starting properly.  Unfortunately, you haven't told us
>> much about
>> the system to determine why.  There can be many reasons.
> 
> Agreed.  Although it did not seem relevant at the time of the post,
> there were network outages around the time of the failure.  What happens
> now is that on the problem node, ccsd starts, but when starting cman it
> sends membership requests but they are never received acknowledged the
> working node.  Again, I see packets received in the /var/log/messages on
> the working node on UDP 6809 from the problem node, but watching
> /var/log/messages on the working node, cman never acknowledges them.
> 
> The problem node had this in its /var/log/messages at the time of the
> problem:
> 
> Dec  1 14:29:38 server1 kernel: CMAN: Being told to leave the cluster by
> node 1
> Dec  1 14:29:38 server1 kernel: CMAN: we are leaving the cluster.

If you're running the cman from RHEL4 Update 3 then there's a bug in there you might be hitting.

You'll need to upgrade all the nodes in the cluster to get rid of it. I can't tell for sure if it is that problem you're having
without seeing more kernel messages though.
-- 

patrick