Handling -ENOBUFS

Wed Nov 5 20:56:30 UTC 2008

On Wed, Nov 5, 2008 at 4:19 PM, Steve Grubb <sgrubb at redhat.com> wrote:
> On Wednesday 05 November 2008 11:30:16 Lucas C. Villa Real wrote:
>> I'm facing a situation where -ENOBUFS is returned from both
>> audit_send() and audit_get_reply(). The system is under high stress,
>> with 250k files being created and having creat() and chmod() syscalls
>> audited.
>
> Is this what you really wanted to audit? :)

Yes, not a single event can be missed in the system I'm working on,
unfortunately :)

>> Looking the code at lib/netlink.c, I saw that audit_send() doesn't
>> handle -ENOBUFS. Would it be possible to replace the condition from
>> "while (retval < 0 && errno == EINTR)" to "while (retval < 0 && (errno
>> == EINTR || errno == ENOBUFS))" to fix the problem when sending
>> packets from userspace to kernel?
>
> Have you tried that? Does it fix the problem or just hang the utility?

So far it didn't hang. However, just in case, I added a maximum number
of retries (currently set to 64). I'm about to launch a new batch to
stress the system once again, and then I'll be able to see if it works
as expected.

>> My understanding for the problem in audit_get_reply() is that the I/O
>> buffers are all full and auditd was just not scheduled at the expected
>> rate, causing these buffers to overflow. Does that make sense?
>
> If you go over the backlog limit, you get a syslog message about that unless
> you have it set to ignore. My guess would be that you have a general network
> memory pool depletion and is not related to audit specifically.

Yes. I hope that increasing auditd's priority will help to drain that.
I'll let you know if that works.

>> If it does, do you have a suggestion about the best way to approach this
>> problem, besides changing auditd's priority?
>
> Increase the backlog and increase auditd's priority. I have not played with
> running auditd with a different scheduler policy than whatever the default
> is. But you may want to see if one of the other scheduler polices treat audit
> better. or maybe you want to tune  /proc/sys/kernel/sched_granularity_ns.
>
>
>> One interesting thing which I noticed is that 'auditctl -s' doesn't
>> report that messages were lost,
>
> They weren't lost by the audit system so it doesn't know they didn't arrive.

Do you think it would make sense to add an extra member to struct
sk_buff (a pointer to a callback function) and then have
skb_queue_tail() signal if it failed to send a message? That would
allow audit to keep track of such losses, as well as any other
subsystem using netlink for communicating with userspace.

>> This is happening with an old kernel, 2.6.16.46 + a bunch of patches,
>> and audit 1.7.4. I cannot completely upgrade it to a new release, but
>> I can certainly backport audit specific bits if you remember having
>> fixed something similar since then.
>
> Well, that proc tunable is only available for the CFS scheduler. Not sure what
> you have for older kernels.

It's not, but I'll keep looking for other ways to improve the
responsiveness of auditd here.

Thanks!
Lucas