[RFC PATCH] audit-testsuite: improve our chances of losing records in lost_reset

Fri Dec 14 16:12:32 UTC 2018

On 2018-12-14 10:53, Paul Moore wrote:
> On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb at redhat.com> wrote:
> > On 2018-12-13 18:23, Paul Moore wrote:
> > > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul at paul-moore.com> wrote:
> > > > If the point of the lost_reset test is to flood the system with audit
> > > > records, why are we restricting ourselves with a filter?  Let's log
> > > > everything.
> > > >
> > > >  tests/lost_reset/test |    4 ++--
> > >
> > > This resolves a problem I've been seeing with a fairly stubborn
> > > system.  I would be very curious to hear if any others were
> > > experiencing problems with the lost_reset that this patch resolves.
> >
> > This saddistic test would have brought the system to its knees before
> > the queue rework.  This still works fine on my system, so this is progress.
> 
> Yes, I'll admit it is a bit extreme.  The good news is that the system
> survives multiple runs of this without falling over (and remains
> responsive), so that's a win :)
> 
> > I suspect that something weird was going on with your test
> > such that it wasn't getting the right PID.  I don't know how this is
> > possible, but I have seen this same thing recently on the FUSE umount2
> > test where I needed to bump up the PID by one to catch the right task.
> 
> The wrong PID idea struck me as a bit curious, and since I have one
> system where I can reproduce this reliably I thought I would play
> around a bit more and see what I could find.  The results sent me down
> a bit of a rathole but the results are interesting ...
> 
> On my test system the PID of the flood ping was being recorded
> correctly, every time I ran it with some instrumentation added - and I
> ran it a quite a few times.  While you may have seen this happen in
> practice, but that doesn't seem to be the case on this particular
> system.
> 
> As an aside, have you spent any time trying to debug that wrong PID
> problem?  While not strictly audit related, that seems like a pretty
> serious Bash bug.  Or maybe it's a problem with the test?  I vaguely
> remember a discussion between you and Ondrej and some confusion around
> which Bash variable to use to fetch PIDs, but I may be mistaken.

I haven't spent much time trying to debug that bash PID increment issue,
but it perplexed me since it was the identical technique used in
multiple tests in the audit-testsuite that has never caused an issue
previously on any of the same set of test machines.  This was for the
missing mount umount2 hang bug test
https://github.com/linux-audit/audit-testsuite/pull/76

> Moving on from the PID idea I decided to dig a bit deeper into when
> the kernel registers a lost record, and I think I found the core
> issue.  The kernel logs lost records, and bumps the counter, via the
> audit_log_lost() function.  The audit_log_lost() function is called
> from a number of places (most uninteresting for this discussion: error
> handlers and the like), but I think the callers we are most interested
> in are the following:
> 
>  * audit_log_start()
>  * kauditd_hold_skb()
> 
> Let's start with audit_log_start().  It calls audit_log_lost() in two
> places: one place is allocation error handling code (not relevant) and
> the other is if we are bumping into the backlog limit (relevant, and
> interesting).  In the latter case audit_log_lost() is called and
> audit_log_start() returns NULL, effectively dropping the record on the
> floor, lost forever.
> 
> Moving on to kauditd_hold_skb(), things are a bit more complicated.
> As a reminder, kauditd_hold_skb() is the "point of last resort" for
> audit records; if we are here it means there is no auditd connected.
> The first thing kauditd_hold_skb() does it write the record to the
> console since this may be the only way the record ever sees userspace.
> Beyond that if "audit=1" was not specified on the kernel command line
> we drop the record, if "audit=1" was specified on the command line we
> try to queue the record but if the queue is full we call
> audit_log_lost().  Ultimately I believe it was this logic, and a race
> between the kauditd_thread() and the record generation, which caused
> my system to never hit the backlog limit in audit_log_start() which
> would have caused it to call audit_log_lost().  On my problematic
> system, booting with "audit=1" caused the existing lost_reset test to
> pass without problem.
> 
> Make sense?
> 
> This brings up to the next step: how do we want to address this?
> 
> Prior to the queue rework that started in v4.10 things were a bit
> simpler and it looks like we always registered a lost record
> independent of the "audit=?" setting on the kernel command line and
> the current queue backlog.  While this would have made this test
> easier, it could result in some over counting problems in the cases
> where an auditd instance came along and read the "lost" records from
> the queue.  I don't think reverting to this behavior is ideal.
> 
> I'm also not certain that recording lost records in the *not*
> "audit=1" case is a good solution either.  In the case where the
> system is not running an audit daemon they are almost guaranteed to
> hit the backlog at some point and then their system will start spewing
> scary looking audit log lost messages; we would surely win lots of
> friends this way.
> 
> We could move the "audit=1" check (really it's an audit_default check)
> into audit_log_lost() and use it to squelch the printk() and then call
> audit_log_lost in kauditd_hold_skb() if we can't queue the record.
> This should preserve an accurate lost record count (well, until it
> rolls over, but that's not a new concern), prevent unnecessary scary
> lost record messages, and ensure a consistent audit_log_lost()
> operation (all the other callers I didn't mention in this mail).
> 
> Or the simplest option is to just ignore this and require that the
> audit-testsuite be run on a system booted with "audit=1" :)
> 
> I'm currently leaning towards moving the "audit=1" check into
> audit_log_lost(), what do you guys think?

I'll need some time to digest all this.  My first reaction is that
requiring "audit=1" for the audit-testsuite is not the right answer and
as we've discussed previously, it makes sense to run the entire
testsuite both with and without "audit=1".

> paul moore

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635