capturing a core file from a non-privileged daemon

Tim Mooney Tim.Mooney at ndsu.edu
Thu Jul 15 20:42:27 UTC 2010


All-

I have a system where a daemon (started as root) periodically forks
non-privileged workers, and those workers sometimes try to dump core.
I would like to capture those core files for debugging, but so far I
have been unable to find a way for the daemon to actually generate the
core file.  I'm hoping someone can point out what I'm missing.

The system is RHEL 4.8, x86_64, currently running 2.6.9-89.0.23.ELsmp.
Running "dmesg", I see dozens of these per day:

#dmesg
lpd[12642]: segfault at 000000000000000c rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[21006]: segfault at 000000000000000c rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[16944]: segfault at 0000000036383675 rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[19501]: segfault at 0000000036383675 rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[11300]: segfault at 000000000000000c rip 0000000000a73cce rsp
00000000ffffcea0 error 4


The daemon is a slightly older version of the LPRng lpd.  Its model is to
start as root but switch to a non-privileged user (lp) and then fork workers
as needed for queue processing.  The daemon is locally-compiled and is
not stripped.

It's being started with the following line in /etc/init.d/lpd:

 	daemon /usr/local/sbin/lpd

Because the daemon shell function defaults to setting "ulimit -c 0", I've
added the following two lines to the startup script, to override that
default behavior:

DAEMON_COREFILE_LIMIT=unlimited
export DAEMON_COREFILE_LIMIT

If I check the /proc/<pid>/limits file for both the master lpd process
or any of the worker processes, I can see that the core file limit is
"unlimited":

$ ps -ef | grep -i lpd
lp       16689     1  1 Jul11 ?        01:17:39 lpd Waiting 
lp       16690 16689  0 Jul11 ?        00:04:58 lpd LOG2

#cat /proc/16689/limits 
Limit                     Soft Limit           Hard Limit           Units 
Max cpu time              unlimited            unlimited            seconds 
Max file size             unlimited            unlimited            bytes 
Max data size             unlimited            unlimited            bytes 
Max stack size            10485760             unlimited            bytes 
Max core file size        unlimited            unlimited            bytes 
Max resident set          unlimited            unlimited            bytes 
Max processes             16383                16383                processes 
Max open files            1024                 1024                 files 
Max locked memory         32768                32768                bytes 
Max address space         unlimited            unlimited            bytes 
Max file locks            unlimited            unlimited            locks 
Max pending signals       1024                 1024                 signals 
Max msgqueue size         819200               819200               bytes

#cat /proc/16690/limits 
Limit                     Soft Limit           Hard Limit           Units 
Max cpu time              unlimited            unlimited            seconds 
Max file size             unlimited            unlimited            bytes 
Max data size             unlimited            unlimited            bytes 
Max stack size            10485760             unlimited            bytes 
Max core file size        unlimited            unlimited            bytes 
Max resident set          unlimited            unlimited            bytes 
Max processes             16383                16383                processes 
Max open files            1024                 1024                 files 
Max locked memory         32768                32768                bytes 
Max address space         unlimited            unlimited            bytes 
Max file locks            unlimited            unlimited            locks 
Max pending signals       1024                 1024                 signals 
Max msgqueue size         819200               819200               bytes


So, it doesn't appear that it's a problem with "ulimit"...

Because the worker processes are non-privileged and the main daemon
process has / as its CWD, it's potentially a permissions problem.  To
get around that, I set kernel.core_pattern so that core files would go
into /tmp:

#sysctl -a | egrep -i 'kernel.core'
kernel.core_pattern = /tmp/core.%p.%e.%s.%t
kernel.core_uses_pid = 1


After doing that, still no joy.  After some web searching, I was
even desperate enough to try setting "kernel.suid_dumpable" parameter
mentioned here:

 	http://wiki.zimbra.com/index.php?title=Enabling_Core_Files

even though the "lpd" process is not setuid, it just starts as root.
That too made no difference.

On the off chance that the kernel.core_pattern wasn't being honored, I
even went so far as to briefly try changing ownership (to "lp") and
permissions (775) on /, to give the daemon permission to dump core in /.
That also made no difference, so it's been undone.

I've also tried pursuing using "systemtap" to install a segfault probe
that just watches for segfaults from processes named "lpd", and that works
but unfortunately systemtap on RHEL4 cannot do user-level tracing, which
is what I need.

Anyone have any ideas on what I've missed?  To be able to debug what's
going on with the worker daemons, I really need to get my hands on some of
the core files.  I'm comfortable with both gdb and strace/ltrace, but if
at all possible I want to avoid attaching to the main daemon and just
using one of those tools to gather a huge volume of data just waiting for
one of the forked children to segfault.  Capturing a core file would be
a much better way to start the debugging process.

Thanks,

Tim
-- 
Tim Mooney                                             Tim.Mooney at ndsu.edu
Enterprise Computing & Infrastructure                  701-231-1076 (Voice)
Room 242-J6, IACC Building                             701-231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164




More information about the redhat-sysadmin-list mailing list