exec-shield mmap & brk randomization

Wed Nov 19 00:46:02 UTC 2003

I kind of suspected that GCL's trouble might relate to brk randomization.
I avoided getting into these details in my first long message because I
wanted to post that write-up about the executability issues for general
reference, and not make it any more complicated than it already was.

> > System-wide, you can disable the exec-shield functionality with:
> > 
> > 	echo 0 > /proc/sys/kernel/exec-shield
> 
> Does this only effect PROT_EXEC settings on memory pages?  

Nope.  This disables the "exec-shield mode" for all new execs (for those
reading kernel sources at home, PF_RELOCEXEC in task_struct.flags).  Using
"setarch i386 foobar" disables the mode for the run of foobar and its
children; otherwise ELF execs have the mode enabled or disabled according
to the presence or absence of PT_GNU_STACK program headers as I've already
described in detail.  That mode enforces nonexecutability as I described
previously.  It also enables some other layout changes.  I'll describe them
after answering your other various questions about how to be sure what's what.

> This at least could function as a work-around for now, if we can make
> configure figure out when it is needed (cat
> /proc/sys/kernel/exec-shield && [ -x setarch ] ?)  If this is the
> wisest solution, let me know and I'll protect the image creations with
> this command.

I would just check for setarch.  You don't really need to check for
/proc/sys/kernel/exec-shield existing, though I suppose it doesn't hurt
since you shouldn't need to use setarch when exec-shield isn't there.

> To my knowledge, we have no nested functions, nor rely on an
> executable C stack.

The failures are pretty obvious when that's the nature of the problem.
i.e., you will get a SIGSEGV with the PC value set to some address,
and you can look in /proc/PID/maps and see the region containing the PC is
not executable, and voila, you're sure that's the problem (or isn't).

> Are these utils in any (unstable) Debian packages?

The `execstack' program is only available as part of the prelink package by
Jakub Jelinek, in very recent versions of that package.  Off hand I don't
know what version of prelink, if any, is in Debian.  readelf and objdump
are part of binutils, and versions too old to know the PT_GNU_STACK magic
number just show you the number instead of the name (match up with <elf.h>
values), so you can still see what's going on.

> So even with nested functions, code should compile and run from
> source, right?  

With current tools on FC1, yes.  You should always be able to tell by
examining the binaries with readelf/objdump/execstack how the tools marked
(or didn't mark) the binary.

> We don't use any asm.

In that case you can be pretty sure that executable stack per se is not the
problem unless you are using GCC nested functions and we have some tools bugs.

> We get all pages via sbrk, and redefine malloc to a call to a native
> memory management system which in turn calls sbrk as needed.  

This is probably where your problem lies.  See below.  

I mentioned layout changes enabled by the exec-shield mode.  The first of
these is randomization of the addresses returned by mmap when not using the
MAP_FIXED flag bit and supplying 0 as the first argument rather than a
specific hint address.  It has always been the case that mmap is specified
to return unpredictable addresses when not given MAP_FIXED, and the
application cannot presume any particular choices will be made (the address
given in the first argument to mmap is a nonbinding suggestion).  In the
past, Linux kernels have always returned a very predictable sequence of
addresses.  In Fedora Core kernels, for processes in exec-shield mode, mmap
returns truly unpredictable addresses.  This affects programs that presume
what addresses their mmap calls will return, and those that presume what
addressses no mmap call will ever return.  Note that this includes the mmap
calls made by the dynamic linker to load shared libraries before any
library or application code gets control, and potentially even the kernel's
mapping of the dynamic linker itself done at exec.  So if you had your eye
on some particular part of the address space not directly mapped by your
executable, it might already be in use by the time you get a chance to look.

Incidentally, the mmap randomization is what broke MIT Scheme.  It presumed
that the low 64MB of the address space would never be used at all, and did
mmap with MAP_FIXED on addresses in that range that would overwrite other
mappings such as those for the shared library containing the mmap function
itself.  That's a case of presuming what addresses "anywhere" mmap calls
would rule out, when no such guarantee was ever part of the specification
of the system interface.  MIT Scheme really wants that particular part of
the address space for its data due to its pointer tagging implementation
(high tags).  The mmaps for shared libraries done before the Scheme runtime
gets control are now randomized and might very well impinge on the
[0,0x4000000) range.  The only proper way to reserve such a range is with a
PT_LOAD program header in the ELF executable, which can request a PROT_NONE
mapping to reserve the range (without consuming any disk or RAM) so that it
has carte blanche to overwrite that range with MAP_FIXED mappings later.
Unfortunately, getting that into your binary is a bit of a pain in the ass
futzing with linker scripts and bits of magic dust.  I posted some quick
examples that demonstrated it adequately for the MIT Scheme maintainer, but
that maintainer is rather more experienced than the average bear.  If you
need to make this happen, I'll be happy to help you figure out the magic.

The second layout change is what I suspect broke GCL.  It broke Emacs's
unexec as well.  Personally, I consider this change incorrect.  However, we
have not yet hashed out among the RH developers concerned with this area
what the resolution will be.  Moreover, I tend to think that anything
broken by it probably ought to be doing things differently in the long run.
Since the dawn of time, the "break area" in Unix has started immediately
after the executable's writable segment (i.e. after its .bss section) and
extended upward from there.  By "the break area", I mean the region of
memory starting at the address returned by sbrk the first time it's called
after an exec.  From the beginning of Unix until two weeks ago Wednesday,
the first `sbrk (0)' returned &end, the end of your .bss; increasing the
break with sbrk calls gave a contiguous region from your data segment
through to the current address of the dynamically-extended break.  In
Fedora Core kernels, for processes in exec-shield mode, this is no longer
the case.  The starting address of the break area is randomized at exec
time, in a fashion similar to the randomization of mmap addresses.  The
first call to `sbrk (0)' will tell you the lower bound of the region, which
will not be lined up with the end of your executable's writable segment.
As always, calls to sbrk to increase the size of the region will work as
long as there is unused address space above the current break region
(randomly placed mmaps will tend to be elsewhere, lower in the address
space, and not impinge on break expansion).  But the exact location of the
region is now somewhat unpredictable, and you can always expect a hole
between static data (your program's writable segment, i.e. .data+.bss) and
the dynamically-extended break region.

As I said, I personally don't like this change.  That's because I consider
starting at &end to have been part of the specification of the break
functionality inherited from ancient Unix, and breaking such things is just
plain wrong.  Nonetheless it is at least for the moment the way things are
in FC1 kernels.  I don't want to engage in a discussion here about the
merits of this change.  I would like to help hash out how precisely it
affects you and any possible ways to work around it there might be.  (If
figuring it all out turns out to lead you not to want to change anything
and instead to complain heartily about the kernel behavior change, then
that's as may be.)

I said that anything broken by it probably ought to be doing things
differently.  The reason I say that is that I consider the brk interface
obsolete.  I can't really see any good reason to be using it in preference
to the other options.  It's inherently limited as an allocation interface
in that it provides just one contiguous region; address space fragmentation
from the mappings for the executable, shared libraries, thread stacks, etc,
will mean that many smaller discontiguous holes are available, and using
only the break region will mean the total limit on useful allocation in the
process can be much lower than the true limit imposed by the configured
resource limits, the system implementation, or available virtual memory
resources.  The other option to use mmap and be prepared to get
discontiguous regions when requesting pages on separate occasions.  If
contiguity is required, you can use the mremap call when available (AFAIK
only on Linux kernels) to extend a region and move it if a sufficient
contiguous region is not free in the original location.  If contiguity is
only somewhat preferred but not strongly so, you can use a nonzero first
argument to mmap (without using the MAP_FIXED flag), and you will
ordinarily get the requested address or the next following address with a
free region of the requested size.  (But note that there is still no
guarantee of getting the requested region without MAP_FIXED and the robust
program must be prepared to handle discontiguous regions.  FC1 kernels will
in fact hand back predictable addresses, but future kernels might not
always do so.)

If you specifically rely on treating static data and the dynamic break
region as a single contiguous region, something has to change.  If you
don't rely on that, but just on knowing the start and end of the contiguous
break region itself, then you can make the simple and portable change of
using sbrk (0) instead of &end to initialize your idea of the lower bound
of the break region.  In the long run, I would still recommend more drastic
changes to avoid relying on a contiguous break region at all, for the
reasons I gave in the previous paragraph.  If you make those changes, you
will not impose so low a limit on the total memory you can allocate in a
process, a benefit you'll get in the same way on all modern Unix-like
systems (at least the 32-bit ones, I guess 64-bit systems always have more
unused address space after the break than you could possibly need).

Please let me know if there is anything I clarify or anything I can do to
help you figure out what changes to your program might be best.

Thanks,
Roland