[Linux-cluster] dlm: message size from 3 too big

lm chen chen1537 at gmail.com
Fri May 21 03:09:18 UTC 2010


hi,

   sticking on gfs2 , so no much time on gfs codes ;

   seems it caused by network break ?  and the message is piled up to more
than dlm_config.buffer_size ;


                if (msglen > dlm_config.buffer_size) {
                        printk("dlm: message size from %d too big %d(pkt
len=%d)\n", nodeid, msglen, len);
                        khexdump((const unsigned char *) msg, len);
                        break;
                }

if someone have interest to take a look at low_comms code how it's flow
control works at sender peer (take into account ,dlm_config.buffer_size)


/**
 * Check status of a cluster service
 *
 * @param svcName       Service name to check.
 * @return              FORWARD, FAIL, 0
 */
int
svc_status(char *svcName)
{
        void *lockp = NULL;
        rg_state_t svcStatus;

        if (rg_lock(svcName, &lockp) < 0) {
                clulog(LOG_ERR, "#48: Unable to obtain cluster lock: %s\n",
                       strerror(errno));
                return FAIL;
        }



2010/5/21 James Chamberlain <jamesc at exa.com>

> Hi all,
>
> I've got a three node cluster running CentOS 4.8, GFS-6.1.19-1.el4_8 (GFS 1
> filesystems), kernel 2.6.9-89.0.19.ELsmp.  I've seen messages like those
> below a couple times in the last couple weeks.  Node 3 doesn't go down, so
> it doesn't get fenced; but DLM is unable to negotiate locks, so the load
> average on each node spikes and the cluster can't serve anything out through
> NFS.  Has anyone seen anything like this? Any idea what to do about it?
>  Shooting node 3 in the head has caused the cluster to recover, but I'd like
> to know how to fix it rather than work around it.
>
> Thanks,
>
> James
>
> [[Operating normally prior to this point]]
> May 20 04:52:50 s12n01 clurgmgrd[7467]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> May 20 04:53:41 s12n03 clurgmgrd[7476]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> May 20 04:54:09 s12n03 kernel: dlm: message size from 3 too big 34560(pkt
> len=386)
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 87-00 00 00 23 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 30 b0 d2 84 02 01 00 00-30 b0 d2 84 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 8e 64 13 80 ff ff ff ff-f0 7d ee 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: f0 7d ee 81 02 01 00 00-02 02 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: a4 de eb 81 02 01 00 00-00 40 01 00 00 01 00
> 00
> May 20 04:54:09 s12n03 kernel: b8 7c ee 81 02 01 00 00-ff ff ff ff ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: 82 01 00 00 00 00 00 00-90 de eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 10 00 00 00 00 00 00-00 00 00 00 00 01 00
> 00
> May 20 04:54:09 s12n03 kernel: b7 6d db b6 6d db b6 6d-be cb 36 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: 60 fa 06 01 00 01 00 00-82 01 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 82 d1 0d 2a 00 01 00 00-7e 0e 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 d0 0d 2a 00 01 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 03 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 68 7e ee 81 02 01 00 00-02 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-90 de eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 01 00 00 00 00 00 00 00-10 50 38 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: fc ff ff ff 00 00 00 00-98 3c eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: b0 c9 14 80 ff ff ff ff-b2 d1 36 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-91 d0 36 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: a8 3c eb 81 02 01 00 00-87 c9 14 80 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: ff ff ff ff ff ff ff ff-98 3c eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 30 3c eb 81 02 01 00 00-c0 86 f2 af 00 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 12
> May 20 04:54:09 s12n03 kernel: 02
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: bad header version 0
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: cmd=0, flags=0, length=1024,
> lkid=1711276032, lockspace=0
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: base=000001002a0dd000,
> offset=1024, len=810, ret=1024, limit=00001000 newbuf=0
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 04-00 00 00 66 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01 00 01
> 00
> May 20 04:54:09 s12n03 kernel: 03 00 72 00 c0 00 a4 23-17 00 00 01 6a 01 c9
> 26
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84 34 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 ff 03 01 16-19 70 00 00 ff 52 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 a6 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 01
> 00
> May 20 04:54:09 s12n03 kernel: 01 00 03 00 72 00 3d 00-7a 2a 17 00 00 01 13
> 00
> May 20 04:54:09 s12n03 kernel: 42 2c 00 00 00 00 08 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 84
> 34
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 ff 03-01 16 19 70 00 00 8c
> 0a
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 37 00 00 00
> 53
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 01 00 01 00 03 00 72 00-49 03 b5 26 17 00 00
> 01
> May 20 04:54:09 s12n03 kernel: 56 01 69 26 00 00 00 00-08 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 84 34 00 00 00 00 00 00-ff 03 01 16 19 70 00
> 00
> May 20 04:54:09 s12n03 kernel: dc fc 00 00 00 00 00 00-00 00 00 00 00 0f 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 6d 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 01 00 01 00 03 00-72 00 75 00 b7 26 17
> 00
> May 20 04:54:09 s12n03 kernel: 00 01 93 02 4f 26 00 00-00 00 08 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 84 34 00 00 00 00-00 00 ff 03 01 16 19
> 70
> May 20 04:54:09 s12n03 kernel: 00 00 62 a7 00 00 00 01-00 00 00 00 00 00 00
> 50
> May 20 04:54:09 s12n03 kernel: 00 00 00 76 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 01 00 01 00-03 00 72 00 8e 02 2d
> 26
> May 20 04:54:09 s12n03 kernel: 17 00 00 01 81 03 85 27-00 00 00 00 08 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 84 34 00 00-00 00 00 00 ff 03 01
> 16
> May 20 04:54:09 s12n03 kernel: 19 70 00 00 6c a4 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 44 00 00 00 49 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 01 00-01 00 03 00 72 00 5b
> 00
> May 20 04:54:09 s12n03 kernel: f0 21 17 00 00 01 3a 02-fb 2b 00 00 00 00 08
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 84 34-00 00 00 00 00 00 ff
> 03
> May 20 04:54:09 s12n03 kernel: 01 16 19 70 00 00 ff 74-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 83-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-01 00 01 00 03 00 72
> 00
> May 20 04:54:10 s12n03 kernel: 9a 02 18 23 17 00 00 01-b9 02 f5 2d 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 08 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-84 34 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: ff 03 01 16 19 70 00 00-ff 83 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 75 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 01 00 01 00 03
> 00
> May 20 04:54:10 s12n03 kernel: 72 00 1b 01 86 2a 17 00-00 01 56 02 d8 28 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 08 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 84 34 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 ff 03 01 16 19 70-00 00 fe 82 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 64 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01 00 01
> 00
> May 20 04:54:10 s12n03 kernel: 03 00 72 00 c6 01 fc 27-17 00 00 01 e0 00 f6
> 28
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84 34 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00
> May 20 04:54:10 s12n03 kernel: ff 03 01 16
> May 20 04:54:10 s12n03 kernel: 19 70 00 00
> May 20 04:54:10 s12n03 kernel: f7
> May 20 04:54:10 s12n03 kernel: a3
> May 20 04:54:10 s12n03 kernel: 00
> May 20 04:54:10 s12n03 kernel: 00
> May 20 04:54:10 s12n03 kernel: dlm: lowcomms: addr=000001002a0dd000,
> base=0, len=1834, iov_len=3710, iov_base[0]=000001002a0dd72a, read=1448
> May 20 04:54:50 s12n01 clurgmgrd[7467]: <err> #50: Unable to obtain cluster
> lock: Connection timed out
> May 20 04:56:41 s12n03 clurgmgrd[7476]: <err> #50: Unable to obtain cluster
> lock: Connection timed out
> May 20 05:02:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> May 20 05:05:13 s12n02 clurgmgrd[7527]: <err> #50: Unable to obtain cluster
> lock: Connection timed out
> May 20 05:08:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> [...]
>
> When I say the load spikes, this is what I mean:
>
> Linux 2.6.9-89.0.19.ELsmp (s12n01)      05/20/2010
>
> 12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
> [...]
> 04:00:01 AM         0       820      0.18      1.84      3.66
> 04:10:01 AM         0       820      2.72      4.03      4.04
> 04:20:01 AM         0       820      3.57      4.62      4.64
> 04:30:01 AM         0       820     11.42      7.35      5.44
> 04:40:01 AM         0       820      4.20      7.51      7.10
> 04:50:01 AM         0       820      1.69      2.18      4.33
> 05:00:01 AM         0       820    513.68    406.40    205.61
> 05:10:01 AM         0       820    530.02    513.44    360.00
> 05:20:01 AM         0       820    530.06    527.83    440.93
> 05:30:01 AM         0       820    530.12    529.75    483.33
> 05:40:01 AM         0       820    530.07    530.04    505.57
> 05:50:01 AM         0       820    530.08    530.05    517.21
> 06:00:01 AM         0       820    530.02    530.03    523.29
> [...]
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100521/fe21c89a/attachment.htm>


More information about the Linux-cluster mailing list