[PATCH] audit: optionally print warning after waiting to enqueue record

Thu Jun 18 22:57:21 UTC 2020

On Thu, Jun 18, 2020 at 09:39:08AM -0400, Steve Grubb wrote:
> On Wednesday, June 17, 2020 6:54:16 PM EDT Max Englander wrote:
> > On Wed, Jun 17, 2020 at 02:47:19PM -0400, Paul Moore wrote:
> > > On Tue, Jun 16, 2020 at 12:58 AM Max Englander <max.englander at gmail.com> 
> wrote:
> > > > In environments where security is prioritized, users may set
> > > > --backlog_wait_time to a high value in order to reduce the likelihood
> > > > that any audit event is lost, even though doing so may result in
> > > > unpredictable performance if the kernel schedules a timeout when the
> > > > backlog limit is exceeded. For these users, the next best thing to
> > > > predictable performance is the ability to quickly detect and react to
> > > > degraded performance. This patch proposes to aid the detection of
> > > > kernel
> > > > audit subsystem pauses through the following changes:
> > > > 
> > > > Add a variable named audit_backlog_warn_time. Enforce the value of this
> > > > variable to be no less than zero, and no more than the value of
> > > > audit_backlog_wait_time.
> > > > 
> > > > If audit_backlog_warn_time is greater than zero and if the total time
> > > > spent waiting to enqueue an audit record is greater than or equal to
> > > > audit_backlog_warn_time, then print a warning with the total time
> > > > spent waiting.
> > > > 
> > > > An example configuration:
> > > >         auditctl --backlog_warn_time 50
> > > > 
> > > > An example warning message:
> > > >         audit: sleep_time=52 >= audit_backlog_warn_time=50
> > > > 
> > > > Tested on Ubuntu 18.04.04 using complementary changes to the audit
> > > > userspace: https://github.com/linux-audit/audit-userspace/pull/131.
> > > > 
> > > > Signed-off-by: Max Englander <max.englander at gmail.com>
> > > > ---
> > > > 
> > > >  include/uapi/linux/audit.h |  7 ++++++-
> > > >  kernel/audit.c             | 35 +++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 41 insertions(+), 1 deletion(-)
> > > 
> > > If an admin is prioritizing security, aka don't loose any audit
> > > records, and there is a concern over variable system latency due to an
> > > audit queue backlog, why not simply disable the backlog limit?
> > 
> > That’s good in some cases, but in other cases unbounded growth of the
> > backlog could result in memory issues. If the kernel runs out of memory
> > it would drop the audit event or possibly have other problems. It could
> > also also consume memory in a way that starves user workloads or causes
> > them to be killed by the OOMKiller.
> 
> The kernel cannot grow the backlog unbounded. If you do nothing, the backlog 
> is 64 - which is too small to really use. Otherwise, you set the backlog to a 
> finite number with the -b option.
> 
> > To refine my motivating use case a bit, if a Kubernetes admin wants to
> > prioritize security, and also avoid unbounded growth of the audit
> > backlog, they may set -b and --backlog_wait_time in a way that limits
> > kernel memory usage and reduces the likelihood that any audit event is
> > lost. Occasional performance degradation may be acceptable to the admin,
> > but they would like a way to be alerted to prolonged kernel pauses, so
> > that they can investigate and take corrective action (increase backlog,
> > increase server capacity, move some workloads to other servers, etc.).
> > 
> > To state another way. The kernel currently can be configured to print a
> > message when the backlog limit is exceeded and it must discard the audit
> > event. This is a useful message for admins, which they can address with
> > corrective action. I think a message similar to the one proposed by this
> > patch would be equally useful when the backlog limit is exceeded and the
> > kernel is configured to wait for the backlog to drain. Admins could
> > address that message in the same way, but without the cost of lost audit
> > events.
> 
> If backlog wait time is exceeded, that could be a useful warning if that does 
> not exist. I don't know how often that could happen...and of course without a 
> warning we don't know if it happens or why it happens.

What you’re describing already exists, if I’m reading your words right.
In the event that the backlog wait time limit is exceeded, the -f flag
is consulted, and, if the value of -f is 1, then an error message
stating that the backlog limit is exceeded is printed. This is also true
when the backlog wait time is zero.

What I am suggesting is that even if the the backlog wait time is not
exceeded, it would be useful for the kernel to report when backlog
waiting occurs as a way to help identify degraded kernel performance.

> I also wished we had metrics on the backlog such as max used. That might help 
> admins tune the size of the backlog.
> 
> -Steve
> 
>