[Cluster-devel] DLM + SCTP bug (was Re: [DRBD-user] kernel panic with DRBD: solved)

Mon Sep 12 17:18:25 UTC 2011

On 9/12/2011 6:43 PM, Florian Haas wrote:
> On 2011-09-12 17:52, Roth, Steven (ESS Software) wrote:
>> I have attempted to debug the kernel panic that I reported on this list
>> last week, which has been reported by several others as well.  The panic
>> happens when DRBD is used in clusters based on corosync (either RHCS or
>> Pacemaker), but only when those clusters are configured with multiple
>> heartbeats (i.e., with “altname” specifications for the cluster nodes). 
>> The panic appears to be caused by two defects, one in the distributed
>> lock manager (DLM, used by corosync) and one in the SCTP network
>> protocol (which is used in clusters with multiple heartbeats).  DRBD
>> code triggers the panic but appears to be blameless for it.
>>
>>  
>>
>> Disclaimer:  I am not a Linux kernel expert; all of my kernel debugging
>> expertise is on a different flavor of Unix.  My assumptions or
>> conclusions may be incorrect; I do not guarantee 100% accuracy of this
>> analysis.  Caveat lector.
>>
>>  
>>
>> Environment:  As will be clear from the analysis below, this defect can
>> manifest in many ways.  I debugged a particular manifestation that
>> occurred with DRBD 8.4.0 running on kernel 2.6.32-71.29.1.el6.x86_64
>> (i.e., RHEL/CentOS 6.0).  The manifestation I debugged was running a two
>> node cluster, shutting down node A and starting it back up.  Node B
>> panics as soon as Node A starts back up.  (See my previous mail for the
>> defect signature.)
>>
>>  
>>
>> When the cluster starts up, it creates a DLM “lockspace”.  This causes
>> the DLM code to create a socket for communication with the other nodes. 
>> Since we’re configured for multiple heartbeats, it’s an SCTP socket. 
>> DLM also creates a bunch of new kernel threads, among which is the
>> dlm_recv thread, which listens for traffic on that socket.  (Actually I
>> see two of them, one per CPU.)  You can see this in a “ps” listing.
>>
>>  
>>
>> An important thing to note here is that all kernel threads are part of
>> the same pseudo-process, and as such, they all share the same set of
>> file descriptors.  However, kernel threads do not normally (ever?) use
>> file descriptors; they tend to work with file structures directly.  The
>> SCTP socket created above, for example, has the appropriate in-kernel
>> socket structure, file structure, and inode structure, but it does not
>> have a file descriptor.  That’s as it should be.
>>
>>  
>>
>> When node A starts back up, the SCTP protocol notices this (as it’s
>> supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART
>> notification to the SCTP socket, telling the socket owner (the dlm_recv
>> thread) that the other node has restarted.  DLM responds by telling SCTP
>> to create a clone of the master socket, for use in communicating with
>> the newly restarted node.  (This is an SCTP_SOCKOPT_PEELOFF request.) 
>> And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is
>> designed to be called from user space, not from a kernel thread, and so
>> it /does/ allocate a file descriptor for the new socket.  Since DLM is
>> calling it from a kernel thread, the kernel thread now has an open file
>> descriptor (#0) to that socket.  And since kernel threads share the same
>> file descriptor, every kernel thread on the system has this open
>> descriptor.  So defect #1 is that DLM is calling an SCTP user-space
>> interface from a kernel thread, which results in pollution of the kernel
>> thread file descriptor table.
>>
>>  
>>
>> Meanwhile, DRBD has its own in-kernel code, running in a different
>> kernel thread.  And it detects (I didn’t bother to track down how) that
>> its peer is back online.  DRBD allows the user to configure handlers for
>> events like that: user space programs that should be called when such an
>> event occurs.  So when DRBD notices that its peer is back, its kernel
>> thread uses call_userhelper() to start a user-space instance of drbdadm
>> to invoke any appropriate handlers.  This is the invocation of drbdadm
>> that we see in the panic report.  (drbdadm gets invoked this way in
>> response to a number of other possible events, as well, so this panic
>> can manifest itself in other ways.)
>>
>>  
>>
>> The key thing about this instance of drbdadm is that it was invoked by a
>> kernel thread.  Therefore it shouldn’t have any open file descriptors —
>> but in this case, it does: it inherits fd 0 pointing to the SCTP
>> socket.  One of the first things that drbdadm does, when starting up, is
>> call isatty(stdin) to find out how it should format its output.  If it
>> were called from user space, that would correctly check whether standard
>> input was interactive.  If it were called correctly from a kernel
>> thread, there would be no stdin and it would correctly return an error. 
>> But what actually happens is that it calls isatty on the SCTP socket
>> that is (incorrectly) in file descriptor 0.
>>
>>  
>>
>> When ioctl is called on a socket, the sock_ioctl() function dereferences
>> the socket data structure pointer (sk).  Defect #2 is that the offending
>> socket in this case has a null sk pointer.  (I did not track down why,
>> but presumably it’s a problem with the SCTP peel-off code.)  So when
>> sock_ioctl() derefences the pointer, the kernel panics.
>>
>>  
>>
>> So, to recap:  this panic occurs because (a) the drbdadm process is
>> erroneously given an SCTP socket as its standard input, and (b) that
>> socket’s data pointer is null, so it panics when drbdadm (reasonably)
>> makes an ioctl call on its standard input.
>>
>>  
>>
>> If you need a workaround for this panic, the best I can offer is to
>> remove the “altname” specifications from the cluster configuration, set
>> <totem rrp_mode=”none”> and <dlm protocol=”tcp”>, so that corosync uses
>> TCP sockets instead of SCTP sockets.
> 
> Wow. Pretty awesome analysis. This is something that the openais
> (Corosync) mailing list should know about, but it currently seems
> affected by the LF breach/outage. Thus, I'm CC'ing two key people here
> directly. Fabio and Steve, could you take a look into this please?

I am CC David and cluster-devel. David maintains DLM in kernel and userland.

A few quick notes about using RRP/altname in a more general fashion.

RRP/altname is expected to be Technology Preview state starting from
RHEL6.2 (the technology will be there for users to test/try but not
officially supported for production yet). We have not done a lot of
intensive testing on the overall RHCS stack yet (except corosync, that
btw does not use DLM) so there might be (== there are) bugs that we will
have to address. Packages in RHEL6.2/Centos6.2 will have reasonable
defaults and they are expected to work better (but far from being bug
free) vs RHEL6.0/Centos6.0.

It will take sometime before all the stack will be fully
tested/supported in such environment but it is a work in progress.

This report is extremely useful and surely will speed up things a lot.

Thanks
Fabio