netlink ACK handling in audit_set_pid()

Tue Oct 17 14:21:32 UTC 2023

We are experiencing strange failures where the audit daemon fails to
start on boot, hitting an ENOBUFS error on its audit_set_pid() call.
This can be reproduced by repeatedly restarting the audit daemon while
the system is under heavy audit load. This also seems to be dependent
on the number of CPUs - we can reproduce this with 2 CPUs but not with
48.

Tracing showed a race between the kernel enabling audit messages to be
sent to the daemon and actually sending the ACK, wherein the socket
buffer could get filled by audit messages before the ACK could be sent,
leading to the ACK being dropped and ENOBUFS set on the socket by
netlink code. A patch to mitigate this race from the kernel side is
separately under discussion on the audit subsystem mailing list:
https://lore.kernel.org/audit/20230922152749.244197-1-chris.riches@nutanix.com/

It's worth noting that this is almost certainly the same issue observed
in this thread from last month (participants CCed):
https://listman.redhat.com/archives/linux-audit/2023-September/020087.html

Here, I am hoping to discuss ACK handling from the userspace side. The
current implementation is a little odd - check_ack() will happily
return success without seeing an ACK if a non-ACK message is top of the
socket queue, but will fail if no message arrives within the timeout.
It also of course fails if ENOBUFS is set on the socket, but this
failure only seems to matter when doing audit_set_pid() - similar
failures during main-loop message processing are logged but otherwise
ignored, as far as I can tell.

I'm not sure I quite understand the intentions of the code, but it
seems odd to let ENOBUFS be a fatal error here, given that it likely
means the socket buffer got flooded with audit messages, and thus
audit_set_pid() succeeded. Perhaps we should just ignore ENOBUFS or
even set NETLINK_NO_ENOBUFS?

It may also be worth increasing the netlink socket buffer size, though
this could only make the issue less likely and would not be sufficient
under arbitrarily heavy audit loads.

Finally, there is another oddity in audit_set_pid() that is tangential
to this discussion but worth highlighting: if the wmode parameter is
WAIT_YES, then there is some additional ACK-handling which waits for
100 milliseconds and eats the top message of the socket queue if one
arrives, without inspecting it. This seems completely wrong as the ACK
will have already been consumed by check_ack() if there was one, and so
the best this code can do is nothing, and at worst (quite likely) it
will swallow a genuine audit message without ever recording it.

- Chris