capturing a core file from a non-privileged daemon
Tim Mooney
Tim.Mooney at ndsu.edu
Thu Jul 15 20:42:27 UTC 2010
All-
I have a system where a daemon (started as root) periodically forks
non-privileged workers, and those workers sometimes try to dump core.
I would like to capture those core files for debugging, but so far I
have been unable to find a way for the daemon to actually generate the
core file. I'm hoping someone can point out what I'm missing.
The system is RHEL 4.8, x86_64, currently running 2.6.9-89.0.23.ELsmp.
Running "dmesg", I see dozens of these per day:
#dmesg
lpd[12642]: segfault at 000000000000000c rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[21006]: segfault at 000000000000000c rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[16944]: segfault at 0000000036383675 rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[19501]: segfault at 0000000036383675 rip 0000000000a73cce rsp
00000000ffffcea0 error 4
lpd[11300]: segfault at 000000000000000c rip 0000000000a73cce rsp
00000000ffffcea0 error 4
The daemon is a slightly older version of the LPRng lpd. Its model is to
start as root but switch to a non-privileged user (lp) and then fork workers
as needed for queue processing. The daemon is locally-compiled and is
not stripped.
It's being started with the following line in /etc/init.d/lpd:
daemon /usr/local/sbin/lpd
Because the daemon shell function defaults to setting "ulimit -c 0", I've
added the following two lines to the startup script, to override that
default behavior:
DAEMON_COREFILE_LIMIT=unlimited
export DAEMON_COREFILE_LIMIT
If I check the /proc/<pid>/limits file for both the master lpd process
or any of the worker processes, I can see that the core file limit is
"unlimited":
$ ps -ef | grep -i lpd
lp 16689 1 1 Jul11 ? 01:17:39 lpd Waiting
lp 16690 16689 0 Jul11 ? 00:04:58 lpd LOG2
#cat /proc/16689/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 16383 16383 processes
Max open files 1024 1024 files
Max locked memory 32768 32768 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 1024 1024 signals
Max msgqueue size 819200 819200 bytes
#cat /proc/16690/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 16383 16383 processes
Max open files 1024 1024 files
Max locked memory 32768 32768 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 1024 1024 signals
Max msgqueue size 819200 819200 bytes
So, it doesn't appear that it's a problem with "ulimit"...
Because the worker processes are non-privileged and the main daemon
process has / as its CWD, it's potentially a permissions problem. To
get around that, I set kernel.core_pattern so that core files would go
into /tmp:
#sysctl -a | egrep -i 'kernel.core'
kernel.core_pattern = /tmp/core.%p.%e.%s.%t
kernel.core_uses_pid = 1
After doing that, still no joy. After some web searching, I was
even desperate enough to try setting "kernel.suid_dumpable" parameter
mentioned here:
http://wiki.zimbra.com/index.php?title=Enabling_Core_Files
even though the "lpd" process is not setuid, it just starts as root.
That too made no difference.
On the off chance that the kernel.core_pattern wasn't being honored, I
even went so far as to briefly try changing ownership (to "lp") and
permissions (775) on /, to give the daemon permission to dump core in /.
That also made no difference, so it's been undone.
I've also tried pursuing using "systemtap" to install a segfault probe
that just watches for segfaults from processes named "lpd", and that works
but unfortunately systemtap on RHEL4 cannot do user-level tracing, which
is what I need.
Anyone have any ideas on what I've missed? To be able to debug what's
going on with the worker daemons, I really need to get my hands on some of
the core files. I'm comfortable with both gdb and strace/ltrace, but if
at all possible I want to avoid attaching to the main daemon and just
using one of those tools to gather a huge volume of data just waiting for
one of the forked children to segfault. Capturing a core file would be
a much better way to start the debugging process.
Thanks,
Tim
--
Tim Mooney Tim.Mooney at ndsu.edu
Enterprise Computing & Infrastructure 701-231-1076 (Voice)
Room 242-J6, IACC Building 701-231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164
More information about the redhat-sysadmin-list
mailing list