Problem with auditd/SnareLinux on RHEL 5.3 - auditd glomming memory

Thu Feb 5 02:14:03 UTC 2009

2009/2/4 Smith, Gary R <gary.smith at pnl.gov>:
> Hello All,

Hi, Gary

> I have some systems that have just been updated to from RHEL 5.2 to RHEL
> 5.3. The version of auditd is 1.7.7 and SnareLinux is 1.5.0.
>
> Some time after the update ran, I noticed that the amount of free memory on
> the systems had dramatically gone down. Running top, I saw that auditd had
> sucked up lots of memory:

[...]

One quick question: are you having lots of events getting logged in
/var/log/audit/audit.log when memory increases?

I noticed a very similar behavior when the system was under high
stress (ie: having many rules and many remote clients generating audit
events). After much debugging, it was found that the asynchronous
nature of netlink made it possible for auditd's queue to grow wildly,
until the kernel started to kill other processes due to OOM (auditd
asks the kernel not to be killed under OOM conditions, so every
process but auditd is shot).

The reason was that audit's consumer thread -- the one that runs
auditd-event.c:event_thread_main() -- was consuming events slower than
the rate in which netlink events were sent from the kernel to auditd's
main thread.

The solution we found (and which is still being tested) was to define
a high water mark on how many events to allow auditd to have in its
input queue. Given that each netlink message takes about 9kb, one can
set the high water mark to e.g: 500000 to have at most 4.5GB events in
RAM. So, when auditd reaches that high water mark, we ask the kernel
to slow down: all further events sent by the kernel have a "need an
ack" flag included so that the caller process (the one that generated
the system call that had to be audited) gets blocked until a reply is
sent from the daemon.

Also, while auditd is in this "OOM mode", for each netlink message
received, the consumer thread digests "N" messages from the input
queue instead of a single one, and just after that it tells the main
thread to acknowledge back to the kernel.

Finally, when the queue reaches a low water mark, auditd tells the
kernel to return into the normal mode, and messages are sent
asynchronously again.

It might be the case that you're falling into the very same situation
(events are being put in the queue in a faster rate than the
consumer's read capabilities). You can check that by doing the
following changes to audit-event.c:

- define a global, static uint64_t input_queue_size;
- increment it in enqueue_event() after the lock on queue_lock is taken;
- decrement it in event_thread_main() before queue_lock is unlocked;
- back into enqueue_event(), add a trace to print input_queue_size
whenever it grows over e.g: 100. You can put that before queue_lock is
unlocked.

Please let me know if that happens to be the reason of the problems
you're having. I've been working mostly with audit 1.7.4 and kernel
2.6.16.16+patches, so our changes still need to be ported to a recent
kernel and audit package before they're submitted officially (that's
likely to happen in march, after my master thesis' final deadline --
which is driving me crazy).

Cheers,
Lucas