[Cluster-devel] DLM + SCTP bug (was Re: [DRBD-user] kernel panic with DRBD: solved)
Fabio M. Di Nitto
fdinitto at redhat.com
Mon Sep 12 17:18:25 UTC 2011
On 9/12/2011 6:43 PM, Florian Haas wrote:
> On 2011-09-12 17:52, Roth, Steven (ESS Software) wrote:
>> I have attempted to debug the kernel panic that I reported on this list
>> last week, which has been reported by several others as well. The panic
>> happens when DRBD is used in clusters based on corosync (either RHCS or
>> Pacemaker), but only when those clusters are configured with multiple
>> heartbeats (i.e., with “altname” specifications for the cluster nodes).
>> The panic appears to be caused by two defects, one in the distributed
>> lock manager (DLM, used by corosync) and one in the SCTP network
>> protocol (which is used in clusters with multiple heartbeats). DRBD
>> code triggers the panic but appears to be blameless for it.
>> Disclaimer: I am not a Linux kernel expert; all of my kernel debugging
>> expertise is on a different flavor of Unix. My assumptions or
>> conclusions may be incorrect; I do not guarantee 100% accuracy of this
>> analysis. Caveat lector.
>> Environment: As will be clear from the analysis below, this defect can
>> manifest in many ways. I debugged a particular manifestation that
>> occurred with DRBD 8.4.0 running on kernel 2.6.32-71.29.1.el6.x86_64
>> (i.e., RHEL/CentOS 6.0). The manifestation I debugged was running a two
>> node cluster, shutting down node A and starting it back up. Node B
>> panics as soon as Node A starts back up. (See my previous mail for the
>> defect signature.)
>> When the cluster starts up, it creates a DLM “lockspace”. This causes
>> the DLM code to create a socket for communication with the other nodes.
>> Since we’re configured for multiple heartbeats, it’s an SCTP socket.
>> DLM also creates a bunch of new kernel threads, among which is the
>> dlm_recv thread, which listens for traffic on that socket. (Actually I
>> see two of them, one per CPU.) You can see this in a “ps” listing.
>> An important thing to note here is that all kernel threads are part of
>> the same pseudo-process, and as such, they all share the same set of
>> file descriptors. However, kernel threads do not normally (ever?) use
>> file descriptors; they tend to work with file structures directly. The
>> SCTP socket created above, for example, has the appropriate in-kernel
>> socket structure, file structure, and inode structure, but it does not
>> have a file descriptor. That’s as it should be.
>> When node A starts back up, the SCTP protocol notices this (as it’s
>> supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART
>> notification to the SCTP socket, telling the socket owner (the dlm_recv
>> thread) that the other node has restarted. DLM responds by telling SCTP
>> to create a clone of the master socket, for use in communicating with
>> the newly restarted node. (This is an SCTP_SOCKOPT_PEELOFF request.)
>> And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is
>> designed to be called from user space, not from a kernel thread, and so
>> it /does/ allocate a file descriptor for the new socket. Since DLM is
>> calling it from a kernel thread, the kernel thread now has an open file
>> descriptor (#0) to that socket. And since kernel threads share the same
>> file descriptor, every kernel thread on the system has this open
>> descriptor. So defect #1 is that DLM is calling an SCTP user-space
>> interface from a kernel thread, which results in pollution of the kernel
>> thread file descriptor table.
>> Meanwhile, DRBD has its own in-kernel code, running in a different
>> kernel thread. And it detects (I didn’t bother to track down how) that
>> its peer is back online. DRBD allows the user to configure handlers for
>> events like that: user space programs that should be called when such an
>> event occurs. So when DRBD notices that its peer is back, its kernel
>> thread uses call_userhelper() to start a user-space instance of drbdadm
>> to invoke any appropriate handlers. This is the invocation of drbdadm
>> that we see in the panic report. (drbdadm gets invoked this way in
>> response to a number of other possible events, as well, so this panic
>> can manifest itself in other ways.)
>> The key thing about this instance of drbdadm is that it was invoked by a
>> kernel thread. Therefore it shouldn’t have any open file descriptors —
>> but in this case, it does: it inherits fd 0 pointing to the SCTP
>> socket. One of the first things that drbdadm does, when starting up, is
>> call isatty(stdin) to find out how it should format its output. If it
>> were called from user space, that would correctly check whether standard
>> input was interactive. If it were called correctly from a kernel
>> thread, there would be no stdin and it would correctly return an error.
>> But what actually happens is that it calls isatty on the SCTP socket
>> that is (incorrectly) in file descriptor 0.
>> When ioctl is called on a socket, the sock_ioctl() function dereferences
>> the socket data structure pointer (sk). Defect #2 is that the offending
>> socket in this case has a null sk pointer. (I did not track down why,
>> but presumably it’s a problem with the SCTP peel-off code.) So when
>> sock_ioctl() derefences the pointer, the kernel panics.
>> So, to recap: this panic occurs because (a) the drbdadm process is
>> erroneously given an SCTP socket as its standard input, and (b) that
>> socket’s data pointer is null, so it panics when drbdadm (reasonably)
>> makes an ioctl call on its standard input.
>> If you need a workaround for this panic, the best I can offer is to
>> remove the “altname” specifications from the cluster configuration, set
>> <totem rrp_mode=”none”> and <dlm protocol=”tcp”>, so that corosync uses
>> TCP sockets instead of SCTP sockets.
> Wow. Pretty awesome analysis. This is something that the openais
> (Corosync) mailing list should know about, but it currently seems
> affected by the LF breach/outage. Thus, I'm CC'ing two key people here
> directly. Fabio and Steve, could you take a look into this please?
I am CC David and cluster-devel. David maintains DLM in kernel and userland.
A few quick notes about using RRP/altname in a more general fashion.
RRP/altname is expected to be Technology Preview state starting from
RHEL6.2 (the technology will be there for users to test/try but not
officially supported for production yet). We have not done a lot of
intensive testing on the overall RHCS stack yet (except corosync, that
btw does not use DLM) so there might be (== there are) bugs that we will
have to address. Packages in RHEL6.2/Centos6.2 will have reasonable
defaults and they are expected to work better (but far from being bug
free) vs RHEL6.0/Centos6.0.
It will take sometime before all the stack will be fully
tested/supported in such environment but it is a work in progress.
This report is extremely useful and surely will speed up things a lot.
More information about the Cluster-devel