[Linux-cluster] dlm: message size from 3 too big
lm chen
chen1537 at gmail.com
Fri May 21 03:09:18 UTC 2010
hi,
sticking on gfs2 , so no much time on gfs codes ;
seems it caused by network break ? and the message is piled up to more
than dlm_config.buffer_size ;
if (msglen > dlm_config.buffer_size) {
printk("dlm: message size from %d too big %d(pkt
len=%d)\n", nodeid, msglen, len);
khexdump((const unsigned char *) msg, len);
break;
}
if someone have interest to take a look at low_comms code how it's flow
control works at sender peer (take into account ,dlm_config.buffer_size)
/**
* Check status of a cluster service
*
* @param svcName Service name to check.
* @return FORWARD, FAIL, 0
*/
int
svc_status(char *svcName)
{
void *lockp = NULL;
rg_state_t svcStatus;
if (rg_lock(svcName, &lockp) < 0) {
clulog(LOG_ERR, "#48: Unable to obtain cluster lock: %s\n",
strerror(errno));
return FAIL;
}
2010/5/21 James Chamberlain <jamesc at exa.com>
> Hi all,
>
> I've got a three node cluster running CentOS 4.8, GFS-6.1.19-1.el4_8 (GFS 1
> filesystems), kernel 2.6.9-89.0.19.ELsmp. I've seen messages like those
> below a couple times in the last couple weeks. Node 3 doesn't go down, so
> it doesn't get fenced; but DLM is unable to negotiate locks, so the load
> average on each node spikes and the cluster can't serve anything out through
> NFS. Has anyone seen anything like this? Any idea what to do about it?
> Shooting node 3 in the head has caused the cluster to recover, but I'd like
> to know how to fix it rather than work around it.
>
> Thanks,
>
> James
>
> [[Operating normally prior to this point]]
> May 20 04:52:50 s12n01 clurgmgrd[7467]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> May 20 04:53:41 s12n03 clurgmgrd[7476]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> May 20 04:54:09 s12n03 kernel: dlm: message size from 3 too big 34560(pkt
> len=386)
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 87-00 00 00 23 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 30 b0 d2 84 02 01 00 00-30 b0 d2 84 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 8e 64 13 80 ff ff ff ff-f0 7d ee 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: f0 7d ee 81 02 01 00 00-02 02 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: a4 de eb 81 02 01 00 00-00 40 01 00 00 01 00
> 00
> May 20 04:54:09 s12n03 kernel: b8 7c ee 81 02 01 00 00-ff ff ff ff ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: 82 01 00 00 00 00 00 00-90 de eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 10 00 00 00 00 00 00-00 00 00 00 00 01 00
> 00
> May 20 04:54:09 s12n03 kernel: b7 6d db b6 6d db b6 6d-be cb 36 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: 60 fa 06 01 00 01 00 00-82 01 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 82 d1 0d 2a 00 01 00 00-7e 0e 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 d0 0d 2a 00 01 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 03 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 68 7e ee 81 02 01 00 00-02 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-90 de eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 01 00 00 00 00 00 00 00-10 50 38 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: fc ff ff ff 00 00 00 00-98 3c eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: b0 c9 14 80 ff ff ff ff-b2 d1 36 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-91 d0 36 a0 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: a8 3c eb 81 02 01 00 00-87 c9 14 80 ff ff ff
> ff
> May 20 04:54:09 s12n03 kernel: ff ff ff ff ff ff ff ff-98 3c eb 81 02 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 30 3c eb 81 02 01 00 00-c0 86 f2 af 00 01 00
> 00
> May 20 04:54:09 s12n03 kernel: 12
> May 20 04:54:09 s12n03 kernel: 02
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: bad header version 0
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: cmd=0, flags=0, length=1024,
> lkid=1711276032, lockspace=0
> May 20 04:54:09 s12n03 kernel: dlm: midcomms: base=000001002a0dd000,
> offset=1024, len=810, ret=1024, limit=00001000 newbuf=0
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 04-00 00 00 66 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01 00 01
> 00
> May 20 04:54:09 s12n03 kernel: 03 00 72 00 c0 00 a4 23-17 00 00 01 6a 01 c9
> 26
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84 34 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 ff 03 01 16-19 70 00 00 ff 52 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 a6 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 01
> 00
> May 20 04:54:09 s12n03 kernel: 01 00 03 00 72 00 3d 00-7a 2a 17 00 00 01 13
> 00
> May 20 04:54:09 s12n03 kernel: 42 2c 00 00 00 00 08 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 84
> 34
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 ff 03-01 16 19 70 00 00 8c
> 0a
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 37 00 00 00
> 53
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 01 00 01 00 03 00 72 00-49 03 b5 26 17 00 00
> 01
> May 20 04:54:09 s12n03 kernel: 56 01 69 26 00 00 00 00-08 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 84 34 00 00 00 00 00 00-ff 03 01 16 19 70 00
> 00
> May 20 04:54:09 s12n03 kernel: dc fc 00 00 00 00 00 00-00 00 00 00 00 0f 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 6d 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 01 00 01 00 03 00-72 00 75 00 b7 26 17
> 00
> May 20 04:54:09 s12n03 kernel: 00 01 93 02 4f 26 00 00-00 00 08 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 84 34 00 00 00 00-00 00 ff 03 01 16 19
> 70
> May 20 04:54:09 s12n03 kernel: 00 00 62 a7 00 00 00 01-00 00 00 00 00 00 00
> 50
> May 20 04:54:09 s12n03 kernel: 00 00 00 76 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 01 00 01 00-03 00 72 00 8e 02 2d
> 26
> May 20 04:54:09 s12n03 kernel: 17 00 00 01 81 03 85 27-00 00 00 00 08 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 84 34 00 00-00 00 00 00 ff 03 01
> 16
> May 20 04:54:09 s12n03 kernel: 19 70 00 00 6c a4 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 44 00 00 00 49 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 01 00-01 00 03 00 72 00 5b
> 00
> May 20 04:54:09 s12n03 kernel: f0 21 17 00 00 01 3a 02-fb 2b 00 00 00 00 08
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 84 34-00 00 00 00 00 00 ff
> 03
> May 20 04:54:09 s12n03 kernel: 01 16 19 70 00 00 ff 74-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 83-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-01 00 01 00 03 00 72
> 00
> May 20 04:54:10 s12n03 kernel: 9a 02 18 23 17 00 00 01-b9 02 f5 2d 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 08 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-84 34 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: ff 03 01 16 19 70 00 00-ff 83 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 75 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 01 00 01 00 03
> 00
> May 20 04:54:10 s12n03 kernel: 72 00 1b 01 86 2a 17 00-00 01 56 02 d8 28 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 08 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 84 34 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 ff 03 01 16 19 70-00 00 fe 82 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 64 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01 00 01
> 00
> May 20 04:54:10 s12n03 kernel: 03 00 72 00 c6 01 fc 27-17 00 00 01 e0 00 f6
> 28
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00 00 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84 34 00
> 00
> May 20 04:54:10 s12n03 kernel: 00 00 00 00
> May 20 04:54:10 s12n03 kernel: ff 03 01 16
> May 20 04:54:10 s12n03 kernel: 19 70 00 00
> May 20 04:54:10 s12n03 kernel: f7
> May 20 04:54:10 s12n03 kernel: a3
> May 20 04:54:10 s12n03 kernel: 00
> May 20 04:54:10 s12n03 kernel: 00
> May 20 04:54:10 s12n03 kernel: dlm: lowcomms: addr=000001002a0dd000,
> base=0, len=1834, iov_len=3710, iov_base[0]=000001002a0dd72a, read=1448
> May 20 04:54:50 s12n01 clurgmgrd[7467]: <err> #50: Unable to obtain cluster
> lock: Connection timed out
> May 20 04:56:41 s12n03 clurgmgrd[7476]: <err> #50: Unable to obtain cluster
> lock: Connection timed out
> May 20 05:02:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> May 20 05:05:13 s12n02 clurgmgrd[7527]: <err> #50: Unable to obtain cluster
> lock: Connection timed out
> May 20 05:08:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain cluster
> lock: Connection timed out
> [...]
>
> When I say the load spikes, this is what I mean:
>
> Linux 2.6.9-89.0.19.ELsmp (s12n01) 05/20/2010
>
> 12:00:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
> [...]
> 04:00:01 AM 0 820 0.18 1.84 3.66
> 04:10:01 AM 0 820 2.72 4.03 4.04
> 04:20:01 AM 0 820 3.57 4.62 4.64
> 04:30:01 AM 0 820 11.42 7.35 5.44
> 04:40:01 AM 0 820 4.20 7.51 7.10
> 04:50:01 AM 0 820 1.69 2.18 4.33
> 05:00:01 AM 0 820 513.68 406.40 205.61
> 05:10:01 AM 0 820 530.02 513.44 360.00
> 05:20:01 AM 0 820 530.06 527.83 440.93
> 05:30:01 AM 0 820 530.12 529.75 483.33
> 05:40:01 AM 0 820 530.07 530.04 505.57
> 05:50:01 AM 0 820 530.08 530.05 517.21
> 06:00:01 AM 0 820 530.02 530.03 523.29
> [...]
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100521/fe21c89a/attachment.htm>
More information about the Linux-cluster
mailing list