答复: [PATCH][RFC] audit: set wait time to zero when audit failed

Wed Sep 18 01:07:28 UTC 2019

> -----邮件原件-----
> 发件人: Paul Moore [mailto:paul at paul-moore.com]
> 发送时间: 2019年9月18日 3:17
> 收件人: Li,Rongqing <lirongqing at baidu.com>
> 抄送: Eric Paris <eparis at redhat.com>; linux-audit at redhat.com
> 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed
> 
> On Mon, Sep 16, 2019 at 9:08 PM Li,Rongqing <lirongqing at baidu.com> wrote:
> > > -----邮件原件-----
> > > 发件人: Paul Moore [mailto:paul at paul-moore.com]
> > > 发送时间: 2019年9月17日 6:52
> > > 收件人: Li,Rongqing <lirongqing at baidu.com>
> > > 抄送: Eric Paris <eparis at redhat.com>; linux-audit at redhat.com
> > > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed
> > >
> > > On Sun, Sep 15, 2019 at 10:55 PM Li,Rongqing <lirongqing at baidu.com>
> wrote:
> > > > > > if audit_log_start failed because queue is full, kauditd is
> > > > > > waiting the receiving queue empty, but no receiver, a task
> > > > > > will be forced to wait 60 seconds for each audited syscall,
> > > > > > and it will be hang for a very long time
> > > > > >
> > > > > > so at this condition, set the wait time to zero to reduce
> > > > > > wait, and restore wait time when audit works again
> > > > > >
> > > > > > it partially restore the commit 3197542482df ("audit: rework
> > > > > > audit_log_start()")
> > > > > >
> > > > > > Signed-off-by: Li RongQing <lirongqing at baidu.com>
> > > > > > Signed-off-by: Liang ZhiCheng <liangzhicheng at baidu.com>
> > > > > > ---
> > > > > > reboot is taking a very long time on my machine(centos 6u4
> > > > > > +kernel
> > > > > > 5.3) since TIF_SYSCALL_AUDIT is set by default, and when
> > > > > > reboot, userspace process which receiver audit message , will
> > > > > > be killed, and lead to that no user drain the audit queue
> > > > > >
> > > > > > git bitsect show it is caused by 3197542482df ("audit: rework
> > > > > > audit_log_start()")
> > > > > >
> > > > > >  kernel/audit.c | 9 +++++++--
> > > > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > > >
> > > > > This is typically solved by increasing the backlog using the
> > > "audit_backlog_limit"
> > > > > kernel parameter (link to the docs below).
> > > >
> > > > It should be able to avoid my issue, but the default behaviors
> > > > does not
> > > working for me; And not all have enough knowledge about audit, who
> > > maybe spend lots of effort to find the root cause, and estimate how
> > > large should be "audit_backlog_limit"
> > >
> > > The pause/sleep behavior is desired behavior and is intended to help
> > > kauditd/auditd process the audit backlog on a busy system.  If we
> > > didn't sleep the current process and give kauditd/auditd a chance to
> > > flush the backlog when it was full, a lot of bad things could happen
> > > with respect to audit.  We generally select the backlog limit so
> > > that this is not a problem for most systems, although there will
> > > always be edge cases where the default does not work well; it is impossible
> to pick defaults that work well for every case.
> > >
> >
> > I just want to it as before 3197542482df ("audit: rework
> > audit_log_start()"), wait 60 seconds once if
> > auditd/readaheaad-collector have some problem to drain the audit backlog.
> 
> The patch you mention fixed what was deemed to be buggy behavior; as
> mentioned previously in this thread I see no good reason to go back to the old
> behavior.
> 
> > > If you are not using audit, you can always disable it via the kernel
> > > command line, or at runtime (look at what Fedora does).
> > >
> > > > > You might also want to investigate what is generating some many
> > > > > audit records prior to starting the audit daemon.
> > > >
> > > > It is /sbin/readahead-collector, in fact, we stop the auditd; We
> > > > are doing a
> > > reboot test, which rebooting machine continue to test hardware/software.
> > > >
> > > > it is same as below:
> > > > auditctl -a always,exit -S all -F pid='xxx'
> > > > kill -s 19 `pidof auditd`
> > > >
> > > > then the audited task will be hung
> > >
> > > So you are seeing this problem only when you run a test, or did you
> > > provide this as a reproducer?
> >
> > auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 `pidof
> > auditd` ssh root at 127.0.0.1
> >
> > then ssh will be hung forever
> 
> That is expected behavior.  You are putting a massive audit load on the system
> by telling the kernel to audit every syscall that sshd makes, then you are
> intentionally killing the audit daemon and attempting to ssh into the system.
> The proper fix(es) here would be to 1) set reasonable audit rules and/or 2) use
> an init system that monitors and restarts auditd when it fails (systemd has this
> capability, I believe some others do as well).
> 

Both are not working.
The auditd is not dead, it is in stop status(kill -s 19). So systemd/init will not restart it.
Even if with little audit rules, after multiple accesses, the backlog will full due to no receiver

whether, I think, the original behavior maybe better

commit ac4cec443a80bfde829516e7a7db10f7325aa528
Author: David Woodhouse <dwmw2 at shinybook.infradead.org>
Date:   Sat Jul 2 14:08:48 2005 +0100

    AUDIT: Stop waiting for backlog after audit_panic() happens

    We force a rate-limit on auditable events by making them wait for space
    on the backlog queue. However, if auditd really is AWOL then this could
    potentially bring the entire system to a halt, depending on the audit
    rules in effect.

    Firstly, make sure the wait time is honoured correctly -- it's the
    maximum time the process should wait, rather than the time to wait
    _each_ time round the loop. We were getting re-woken _each_ time a
    packet was dequeued, and the timeout was being restarted each time.

    Secondly, reset the wait time after audit_panic() is called. In general
    this will be reset to zero, to allow progress to be made. If the system
    is configured to _actually_ panic on audit_panic() then that will
    already have happened; otherwise we know that audit records are being
    lost anyway.

    These two tunables can't be exposed via AUDIT_GET and AUDIT_SET because
    those aren't particularly well-designed. It probably should have been
    done by sysctls or sysfs anyway -- one for a later patch.

Thanks

-RongQing
> --
> paul moore
> www.paul-moore.com