[RFC PATCH] audit-testsuite: improve our chances of losing records in lost_reset

Fri Dec 14 20:35:42 UTC 2018

On Fri, Dec 14, 2018 at 11:12 AM Richard Guy Briggs <rgb at redhat.com> wrote:
> On 2018-12-14 10:53, Paul Moore wrote:
> > On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb at redhat.com> wrote:
> > > On 2018-12-13 18:23, Paul Moore wrote:
> > > > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul at paul-moore.com> wrote:

...

> > As an aside, have you spent any time trying to debug that wrong PID
> > problem?  While not strictly audit related, that seems like a pretty
> > serious Bash bug.  Or maybe it's a problem with the test?  I vaguely
> > remember a discussion between you and Ondrej and some confusion around
> > which Bash variable to use to fetch PIDs, but I may be mistaken.
>
> I haven't spent much time trying to debug that bash PID increment issue,
> but it perplexed me since it was the identical technique used in
> multiple tests in the audit-testsuite that has never caused an issue
> previously on any of the same set of test machines.  This was for the
> missing mount umount2 hang bug test
> https://github.com/linux-audit/audit-testsuite/pull/76

Ah, I think I see the problem.  Unless I'm mistaken, you are talking
about the shell/Bash command where the tests print the PID using "echo
$$" or similar, yes?  As you likely already know, in Bash (and likely
other shells as well), $$ is the PID of the running Bash process; in
the two places where I saw it used in the existing audit-testsuite $$
is being used to reference the Bash instance itself or something it
exec's (which ends up inheriting the PID).  It looks like you are
using $$ as a way to capture the PID of a child process being spawned
by the shell, yes?  This may explain why you sometimes get lucky and
$$+1 works for the PID.

> > This brings up to the next step: how do we want to address this?
> >
> > Prior to the queue rework that started in v4.10 things were a bit
> > simpler and it looks like we always registered a lost record
> > independent of the "audit=?" setting on the kernel command line and
> > the current queue backlog.  While this would have made this test
> > easier, it could result in some over counting problems in the cases
> > where an auditd instance came along and read the "lost" records from
> > the queue.  I don't think reverting to this behavior is ideal.
> >
> > I'm also not certain that recording lost records in the *not*
> > "audit=1" case is a good solution either.  In the case where the
> > system is not running an audit daemon they are almost guaranteed to
> > hit the backlog at some point and then their system will start spewing
> > scary looking audit log lost messages; we would surely win lots of
> > friends this way.
> >
> > We could move the "audit=1" check (really it's an audit_default check)
> > into audit_log_lost() and use it to squelch the printk() and then call
> > audit_log_lost in kauditd_hold_skb() if we can't queue the record.
> > This should preserve an accurate lost record count (well, until it
> > rolls over, but that's not a new concern), prevent unnecessary scary
> > lost record messages, and ensure a consistent audit_log_lost()
> > operation (all the other callers I didn't mention in this mail).
> >
> > Or the simplest option is to just ignore this and require that the
> > audit-testsuite be run on a system booted with "audit=1" :)
> >
> > I'm currently leaning towards moving the "audit=1" check into
> > audit_log_lost(), what do you guys think?
>
> I'll need some time to digest all this.  My first reaction is that
> requiring "audit=1" for the audit-testsuite is not the right answer and
> as we've discussed previously, it makes sense to run the entire
> testsuite both with and without "audit=1".

Yes, requiring "audit=1" is not an answer.  I put it there mostly as
an attempt at humor, hence the smiley at the end, but clearly my
abilities as a comic are lacking ;)

Anyway, no rush on looking it over; this isn't a -stable fix in my
eyes so it's not a candidate for merging until after the upcoming
merge window.  I think it will make a lot more sense once you go back
and look at the code, reading about the problem in email probably
makes is sound more complicated than it is.  I'm fairly certain the
right solution is to move the "audit=1" check into audit_log_lost(),
but I sent this out in case you or anyone else could think of a better
solution.  In the meantime I'll throw together a quick patch (it
should be pretty small) and add it to my testing queue.

-- 
paul moore
www.paul-moore.com