[Cluster-devel] Panic when stopping gulm.

Mathieu Avila mathieu.avila at seanodes.com
Mon Oct 16 14:07:38 UTC 2006


Hello, 


I got panics sometimes, when stopping gulm on my whole cluster.
These are really not very frequent. The panics appear inside a function
of the "ipv6" module, when called by one of the gulm kernel threads. 


^MProcess gulm_res_recvd (pid: 5029, threadinfo 0000010021300000, task
000001003f60f030)
^MStack: 0000000000004034 0000000000000000 0000001e124dd670
000001001533d380 ^M       0000010021301d08 0000000000000000
0000010021301e18 000001003c3a5e00 ^M       00000100124dd670
0000000023222120 ^MCall Trace:<ffffffffa01d7a58>{:ipv6:tcp_v6_xmit+611}
<ffffffff80134dea>{autoremove_wake_function+0}
^M       <ffffffffa02c6af4>{:lock_gulm:do_tfer+252}
<ffffffffa02c6bcb>{:lock_gulm:xdr_send+34}
^M       <ffffffffa02c5c53>{:lock_gulm:xdr_enc_flush+44}
<ffffffffa02c5cc3>{:lock_gulm:xdr_enc_release+19}
^M       <ffffffffa02c383b>{:lock_gulm:lg_core_handle_messages+394}
^M       <ffffffffa02be1b7>{:lock_gulm:cm_io_recving_thread+73}
^M       <ffffffff80110e17>{child_rip+8}
<ffffffffa02be16e>{:lock_gulm:cm_io_recving_thread+0}
^M       <ffffffff80110e0f>{child_rip+0}


I looked at the code, in src/gulm/xdr_io.c, in function "do_tfer".
I find something strange :

---------------------------------------------------
	for (;;) {
		m.msg_iov = iov;
		m.msg_iovlen = n;
		m.msg_flags = MSG_NOSIGNAL;

		if (dir)
			rv = sock_sendmsg (sock, &m, size - moved);
		else
			rv = sock_recvmsg (sock, &m, size - moved, 0);

		if (rv <= 0)
			goto out_err;
		moved += rv;

		if (moved >= size)
			break;

		/* adjust iov's for next transfer */
		while (iov->iov_len == 0) {
			iov++;
			n--;
		}
---------------------------------------------------

In my opinion, when "sock_sendmsg" doesn't return the
exact size that was asked to be sent, we get into  
		while (iov->iov_len == 0) {
			iov++;
			n--;
		}
Even if we are already at the last buffer, without checking "n", which
is the number of buffers in the table "iov". "sock_sendmsg" is then
called with an invalid buffer pointer.... (m.msg_iov = iov)
I don't know if this is of any interest, since "n" always equals "1",
wherever "do_tfer" is called.

Anyway, this couldn't happen if "n" was checked:
---------------------------
		while ( (n>1)&&(iov->iov_len == 0) {
			iov++;
			n--;
		}
		if (n<=1) break;
---------------------------

This still doesn't guarantee that the message will be sent as a
whole. Using : 
		m.msg_flags = MSG_NOSIGNAL | MSG_WAITALL;
and a loop over sock_sendmsg till the full message is sent is the
solution, maybe.

Any idea on this ?

Thanks in advance,

--
Mathieu Avila





More information about the Cluster-devel mailing list