[Linux-cluster] Slowness above 500 RRDs

Tue Jun 12 16:39:34 UTC 2007

David Teigland <teigland at redhat.com> writes:

> On Tue, Jun 12, 2007 at 05:06:56PM +0200, Ferenc Wagner wrote:
> 
>> Here is the old mail I haven't sent before.  Meanwhile, I'm switching
>> in other nodes to continue the tests in my previous mail.

[...]

>> But looks like nodeA feels obliged to communicate its locking
>> process around the cluster.
>
> I'm not sure what you mean here.  To see the amount of dlm locking traffic
> on the network, look at port 21064.  There should be very little in the
> test above... and the dlm locking that you do see should mostly be related
> to file i/o, not flocks.

There was much traffic on port 21064.  Possibly related to file I/O
and not flocks, I can't tell.  But that's agrees with my speculation,
that it's not the explicit [pf]locks that take much time, but
something else.

>> What confuses me is that he emits multicast packets even when he's the
>> only member.  Otherwise, it passes tokens around the cluster, which
>> makes more sense, though still unnecessary, as he is the lock master (if
>> I get the lock master concept right).
>
> I think you're confusing the multicast network traffic from openais
> (related to cluster membership) and the point-to-point network traffic
> from the dlm (related to gfs locking).  The two types of traffic are not
> related.

I didn't notice any multicast traffic when the node wasn't alone, but
maybe it was simply dwarfed by the locking traffic.  I can check that
again later, but...

>> # cman_tool services
>> type             level name     id       state       
>> fence            0     default  00010001 none        
>> [1 2 3]
>> dlm              1     clvmd    00020001 none        
>> [1 2 3]
>> dlm              1     test     000a0001 none        
>> [1 2]
>> gfs              2     test     00090001 none        
>> [1 2]
>
> !?!? but now you're using the old RHEL4 generation stuff -- gfs_controld
> is completely irrelevant there.  The analysis completely changes between
> the RHEL4/RHEL5 (old/new) generations of infrastructure.

To my best knowledge, I'm using the new infrastructure.  There's no
cman kernel module loaded, there's no cman process running, there's an
aisexec process running, syslog contains messages like

openais[4374]: [CLM  ] CLM CONFIGURATION CHANGE 
openais[4374]: [CLM  ] New Configuration: 
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [CLM  ] Members Left: 
openais[4374]: [CLM  ] Members Joined: 
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [SYNC ] This node is within the primary component and will provide service. 
openais[4374]: [TOTEM] entering OPERATIONAL state. 
openais[4374]: [CLM  ] got nodejoin message XXX.XXX.XXX.XXX 
openais[4374]: [CLM  ] got nodejoin message XXX.XXX.XXX.XXX 
openais[4374]: [CLM  ] got nodejoin message XXX.XXX.XXX.XXX 
openais[4374]: [CPG  ] got joinlist message from node 1 
openais[4374]: [CPG  ] got joinlist message from node 2 
kernel: dlm: connecting to 3
kernel: dlm: got connection from 3

lsmod gives

gfs                   256964  1 
lock_nolock             4480  0 
lock_dlm               20684  2 
gfs2                  328076  3 gfs,lock_nolock,lock_dlm
dlm                    92340  17 lock_dlm
configfs               25616  2 dlm

How could I be running the old stuff?  Am I totally confused?
-- 
Feri.