[Linux-cluster] dlm: message size from 3 too big

James Chamberlain jamesc at exa.com
Mon May 24 14:02:31 UTC 2010


LM,

That's interesting to note.  I know that there have been some issues  
with that particular Ethernet switch, but I was assured they had been  
resolved.  That gives me somewhere to start looking, at any rate.

Thanks,

James

On May 20, 2010, at 11:09 PM, lm chen wrote:

> hi,
>
>    sticking on gfs2 , so no much time on gfs codes ;
>
>    seems it caused by network break ?  and the message is piled up  
> to more than dlm_config.buffer_size ;
>
>
>                 if (msglen > dlm_config.buffer_size) {
>                         printk("dlm: message size from %d too big  
> %d(pkt len=%d)\n", nodeid, msglen, len);
>                         khexdump((const unsigned char *) msg, len);
>                         break;
>                 }
>
> if someone have interest to take a look at low_comms code how it's  
> flow control works at sender peer (take into  
> account ,dlm_config.buffer_size)
>
>
> /**
>  * Check status of a cluster service
>  *
>  * @param svcName       Service name to check.
>  * @return              FORWARD, FAIL, 0
>  */
> int
> svc_status(char *svcName)
> {
>         void *lockp = NULL;
>         rg_state_t svcStatus;
>
>         if (rg_lock(svcName, &lockp) < 0) {
>                 clulog(LOG_ERR, "#48: Unable to obtain cluster lock:  
> %s\n",
>                        strerror(errno));
>                 return FAIL;
>         }
>
>
>
> 2010/5/21 James Chamberlain <jamesc at exa.com>
> Hi all,
>
> I've got a three node cluster running CentOS 4.8, GFS-6.1.19-1.el4_8  
> (GFS 1 filesystems), kernel 2.6.9-89.0.19.ELsmp.  I've seen messages  
> like those below a couple times in the last couple weeks.  Node 3  
> doesn't go down, so it doesn't get fenced; but DLM is unable to  
> negotiate locks, so the load average on each node spikes and the  
> cluster can't serve anything out through NFS.  Has anyone seen  
> anything like this? Any idea what to do about it?  Shooting node 3  
> in the head has caused the cluster to recover, but I'd like to know  
> how to fix it rather than work around it.
>
> Thanks,
>
> James
>
> [[Operating normally prior to this point]]
> May 20 04:52:50 s12n01 clurgmgrd[7467]: <err> #48: Unable to obtain  
> cluster lock: Connection timed out
> May 20 04:53:41 s12n03 clurgmgrd[7476]: <err> #48: Unable to obtain  
> cluster lock: Connection timed out
> May 20 04:54:09 s12n03 kernel: dlm: message size from 3 too big  
> 34560(pkt len=386)
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 87-00 00 00 23  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 30 b0 d2 84 02 01 00 00-30 b0 d2 84  
> 02 01 00 00
> May 20 04:54:09 s12n03 kernel: 8e 64 13 80 ff ff ff ff-f0 7d ee 81  
> 02 01 00 00
> May 20 04:54:09 s12n03 kernel: f0 7d ee 81 02 01 00 00-02 02 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: a4 de eb 81 02 01 00 00-00 40 01 00  
> 00 01 00 00
> May 20 04:54:09 s12n03 kernel: b8 7c ee 81 02 01 00 00-ff ff ff ff  
> ff ff ff ff
> May 20 04:54:09 s12n03 kernel: 82 01 00 00 00 00 00 00-90 de eb 81  
> 02 01 00 00
> May 20 04:54:09 s12n03 kernel: 00 10 00 00 00 00 00 00-00 00 00 00  
> 00 01 00 00
> May 20 04:54:09 s12n03 kernel: b7 6d db b6 6d db b6 6d-be cb 36 a0  
> ff ff ff ff
> May 20 04:54:09 s12n03 kernel: 60 fa 06 01 00 01 00 00-82 01 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 82 d1 0d 2a 00 01 00 00-7e 0e 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 d0 0d 2a 00 01 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 03 00 00 00
> May 20 04:54:09 s12n03 kernel: 68 7e ee 81 02 01 00 00-02 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-90 de eb 81  
> 02 01 00 00
> May 20 04:54:09 s12n03 kernel: 01 00 00 00 00 00 00 00-10 50 38 a0  
> ff ff ff ff
> May 20 04:54:09 s12n03 kernel: fc ff ff ff 00 00 00 00-98 3c eb 81  
> 02 01 00 00
> May 20 04:54:09 s12n03 kernel: b0 c9 14 80 ff ff ff ff-b2 d1 36 a0  
> ff ff ff ff
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-91 d0 36 a0  
> ff ff ff ff
> May 20 04:54:09 s12n03 kernel: a8 3c eb 81 02 01 00 00-87 c9 14 80  
> ff ff ff ff
> May 20 04:54:09 s12n03 kernel: ff ff ff ff ff ff ff ff-98 3c eb 81  
> 02 01 00 00
> May 20 04:54:09 s12n03 kernel: 30 3c eb 81 02 01 00 00-c0 86 f2 af  
> 00 01 00 00
> May 20 04:54:09 s12n03 kernel: 12
> May 20 04:54:09 s12n03 kernel: 02
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: bad header version 0
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: cmd=0, flags=0,  
> length=1024, lkid=1711276032, lockspace=0
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: base=000001002a0dd000,  
> offset=1024, len=810, ret=1024, limit=00001000 newbuf=0
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 04-00 00 00 66  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 01 00 01 00
> May 20 04:54:09 s12n03 kernel: 03 00 72 00 c0 00 a4 23-17 00 00 01  
> 6a 01 c9 26
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 84 34 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 ff 03 01 16-19 70 00 00  
> ff 52 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 a6 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 01 00
> May 20 04:54:09 s12n03 kernel: 01 00 03 00 72 00 3d 00-7a 2a 17 00  
> 00 01 13 00
> May 20 04:54:09 s12n03 kernel: 42 2c 00 00 00 00 08 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 84 34
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 ff 03-01 16 19 70  
> 00 00 8c 0a
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 37  
> 00 00 00 53
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 01 00 01 00 03 00 72 00-49 03 b5 26  
> 17 00 00 01
> May 20 04:54:09 s12n03 kernel: 56 01 69 26 00 00 00 00-08 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 84 34 00 00 00 00 00 00-ff 03 01 16  
> 19 70 00 00
> May 20 04:54:09 s12n03 kernel: dc fc 00 00 00 00 00 00-00 00 00 00  
> 00 0f 00 00
> May 20 04:54:09 s12n03 kernel: 00 6d 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 01 00 01 00 03 00-72 00 75 00  
> b7 26 17 00
> May 20 04:54:09 s12n03 kernel: 00 01 93 02 4f 26 00 00-00 00 08 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 84 34 00 00 00 00-00 00 ff 03  
> 01 16 19 70
> May 20 04:54:09 s12n03 kernel: 00 00 62 a7 00 00 00 01-00 00 00 00  
> 00 00 00 50
> May 20 04:54:09 s12n03 kernel: 00 00 00 76 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 01 00 01 00-03 00 72 00  
> 8e 02 2d 26
> May 20 04:54:09 s12n03 kernel: 17 00 00 01 81 03 85 27-00 00 00 00  
> 08 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 84 34 00 00-00 00 00 00  
> ff 03 01 16
> May 20 04:54:09 s12n03 kernel: 19 70 00 00 6c a4 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 44 00 00 00 49 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 01 00-01 00 03 00  
> 72 00 5b 00
> May 20 04:54:09 s12n03 kernel: f0 21 17 00 00 01 3a 02-fb 2b 00 00  
> 00 00 08 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 84 34-00 00 00 00  
> 00 00 ff 03
> May 20 04:54:09 s12n03 kernel: 01 16 19 70 00 00 ff 74-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 83-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-01 00 01 00  
> 03 00 72 00
> May 20 04:54:10 s12n03 kernel: 9a 02 18 23 17 00 00 01-b9 02 f5 2d  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 08 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-84 34 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: ff 03 01 16 19 70 00 00-ff 83 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 75 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 01 00  
> 01 00 03 00
> May 20 04:54:10 s12n03 kernel: 72 00 1b 01 86 2a 17 00-00 01 56 02  
> d8 28 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 08 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 84 34  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 ff 03 01 16 19 70-00 00 fe 82  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 64  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 01 00 01 00
> May 20 04:54:10 s12n03 kernel: 03 00 72 00 c6 01 fc 27-17 00 00 01  
> e0 00 f6 28
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00  
> 00 00 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00  
> 84 34 00 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00
> May 20 04:54:10 s12n03 kernel: ff 03 01 16
> May 20 04:54:10 s12n03 kernel: 19 70 00 00
> May 20 04:54:10 s12n03 kernel: f7
> May 20 04:54:10 s12n03 kernel: a3
> May 20 04:54:10 s12n03 kernel: 00
> May 20 04:54:10 s12n03 kernel: 00
> May 20 04:54:10 s12n03 kernel: dlm: lowcomms: addr=000001002a0dd000,  
> base=0, len=1834, iov_len=3710, iov_base[0]=000001002a0dd72a,  
> read=1448
> May 20 04:54:50 s12n01 clurgmgrd[7467]: <err> #50: Unable to obtain  
> cluster lock: Connection timed out
> May 20 04:56:41 s12n03 clurgmgrd[7476]: <err> #50: Unable to obtain  
> cluster lock: Connection timed out
> May 20 05:02:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain  
> cluster lock: Connection timed out
> May 20 05:05:13 s12n02 clurgmgrd[7527]: <err> #50: Unable to obtain  
> cluster lock: Connection timed out
> May 20 05:08:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain  
> cluster lock: Connection timed out
> [...]
>
> When I say the load spikes, this is what I mean:
>
> Linux 2.6.9-89.0.19.ELsmp (s12n01)      05/20/2010
>
> 12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
> [...]
> 04:00:01 AM         0       820      0.18      1.84      3.66
> 04:10:01 AM         0       820      2.72      4.03      4.04
> 04:20:01 AM         0       820      3.57      4.62      4.64
> 04:30:01 AM         0       820     11.42      7.35      5.44
> 04:40:01 AM         0       820      4.20      7.51      7.10
> 04:50:01 AM         0       820      1.69      2.18      4.33
> 05:00:01 AM         0       820    513.68    406.40    205.61
> 05:10:01 AM         0       820    530.02    513.44    360.00
> 05:20:01 AM         0       820    530.06    527.83    440.93
> 05:30:01 AM         0       820    530.12    529.75    483.33
> 05:40:01 AM         0       820    530.07    530.04    505.57
> 05:50:01 AM         0       820    530.08    530.05    517.21
> 06:00:01 AM         0       820    530.02    530.03    523.29
> [...]
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100524/60eeae50/attachment.htm>


More information about the Linux-cluster mailing list