[Linux-cluster] gfs2, kvm setup

Wed Jul 9 15:40:04 UTC 2008

On Wed, Jul 09, 2008 at 09:51:02AM +0100, Christine Caulfield wrote:
> Steven Whitehouse wrote:
>> Hi,
>>
>> On Tue, 2008-07-08 at 18:15 -0400, J. Bruce Fields wrote:
>>> On Mon, Jul 07, 2008 at 02:49:28PM -0400, bfields wrote:
>>>> On Mon, Jul 07, 2008 at 10:48:28AM -0500, David Teigland wrote:
>>>>> On Sun, Jul 06, 2008 at 05:51:05PM -0400, J. Bruce Fields wrote:
>>>>>> -	write(control_fd, in, sizeof(struct gdlm_plock_info));
>>>>>> +	write(control_fd, in, sizeof(struct dlm_plock_info));
>>>>> Gah, sorry, I keep fixing that and it keeps reappearing.
>>>>>
>>>>>
>>>>>> Jul  1 14:06:42 piglet2 kernel: dlm: connect from non cluster node
>>>>>> It looks like dlm_new_workspace() is waiting on dlm_recoverd, which is
>>>>>> in "D" state in dlm_rcom_status(), so I guess the second node isn't
>>>>>> getting some dlm reply it expects?
>>>>> dlm inter-node communication is not working here for some reason.  There
>>>>> must be something unusual with the way the network is configured on the
>>>>> nodes, and/or a problem with the way the cluster code is applying the
>>>>> network config to the dlm.
>>>>>
>>>>> Ah, I just remembered what this sounds like; we see this kind of thing
>>>>> when a network interface has multiple IP addresses, and/or routing is
>>>>> configured strangely.  Others cc'ed could offer better details on exactly
>>>>> what to look for.
>>>> OK, thanks!  I'm trying to run gfs2 on 4 kvm machines, I'm an expert on
>>>> neither, and it's entirely likely there's some obvious misconfiguration.
>>>> On the kvm host there are 4 virtual interfaces bridged together:
>>> I ran wireshark on vnet0 while doing the second mount; what I saw was
>>> the second machine opened a tcp connection to port 21064 on the first
>>> (which had already completed the mount), and sent it a single message
>>> identified by wireshark as "DLM3" protocol, type recovery command:
>>> status command.  It got back an ACK then a RST.
>>>
>>> Then the same happened in the other direction, with the first machine
>>> sending a similar message to port 21064 on the second, which then reset
>>> the connection.
>>>
>
> That's a symptom of the "connect from non-cluster node" error in the  
> DLM.

I think I am getting a message to that affect in my logs.

> It's got a connection from an IP address that is not known to cman.  
> So it closes it as a spoofer

OK.  Is there an easy way to see the list of ip addresses known to cman?

> You'll need to check the routing of the interfaces. The most common  
> cause of this sort of error is having two interfaces on the same  
> physical (or internal) network.

Thanks, that's helpful.

--b.