[RFC PATCH] audit: improve audit queue handling when "audit=1" on cmdline

Wed Jan 19 16:10:43 UTC 2022

On Wed, Jan 19, 2022 at 7:23 AM cuigaosheng <cuigaosheng1 at huawei.com> wrote:
>
> Hi Paul,
>
> There are some questions about this patch and hope it helps.

Hello,

No problem, thanks for taking a look, responses below ...

> /**
>   * kauditd_retry_skb - Queue an audit record, attempt to send again to auditd
>   * @skb: audit record
> + * @error: error code (unused)
>   *
>   * Description:
>   * Not as serious as kauditd_hold_skb() as we still have a connected auditd,
>   * but for some reason we are having problems sending it audit records so
>   * queue the given record and attempt to resend.
>   */
> -static void kauditd_retry_skb(struct sk_buff *skb)
> +static void kauditd_retry_skb(struct sk_buff *skb, __always_unused int error)
>  {
> - /* NOTE: because records should only live in the retry queue for a
> - * short period of time, before either being sent or moved to the hold
> - * queue, we don't currently enforce a limit on this queue */
> - skb_queue_tail(&audit_retry_queue, skb);
> + if (!audit_backlog_limit ||
> +    skb_queue_len(&audit_retry_queue) < audit_backlog_limit) {
> + skb_queue_tail(&audit_retry_queue, skb);
> + return;
> + }
> +
> + audit_log_lost("kauditd retry queue overflow");
> + kfree_skb(skb);
>  }
>
> When we process the main queue, should we printk the skb when audit_log_lost be call ?

That's a good point, I'll add that, thank you.

> /**
>   * kauditd_hold_skb - Queue an audit record, waiting for auditd
>   * @skb: audit record
> + * @error: error code
>   *
>   * Description:
>   * Queue the audit record, waiting for an instance of auditd.  When this
> @@ -564,19 +566,31 @@ static void kauditd_rehold_skb(struct sk_buff *skb)
>   * and queue it, if we have room.  If we want to hold on to the record, but we
>   * don't have room, record a record lost message.
>   */
> -static void kauditd_hold_skb(struct sk_buff *skb)
> +static void kauditd_hold_skb(struct sk_buff *skb, int error)
>  {
>   /* at this point it is uncertain if we will ever send this to auditd so
>   * try to send the message via printk before we go any further */
>   kauditd_printk_skb(skb);
>
>   /* can we just silently drop the message? */
> - if (!audit_default) {
> - kfree_skb(skb);
> - return;
> + if (!audit_default)
> + goto drop;
> +
> + /* the hold queue is only for when the daemon goes away completely,
> + * not -EAGAIN failures; if we are in a -EAGAIN state requeue the
> + * record on the retry queue unless it's full, in which case drop it
> + */
> + if (error == -EAGAIN) {
> + if (!audit_backlog_limit ||
> +    skb_queue_len(&audit_retry_queue) < audit_backlog_limit) {
> + skb_queue_tail(&audit_retry_queue, skb);
> + return;
> + }
> + audit_log_lost("kauditd retry queue overflow");
> + goto drop;
>   }
>
> - /* if we have room, queue the message */
> + /* if we have room in the hold queue, queue the message */
>   if (!audit_backlog_limit ||
>      skb_queue_len(&audit_hold_queue) < audit_backlog_limit) {
>   skb_queue_tail(&audit_hold_queue, skb);
> @@ -585,24 +599,30 @@ static void kauditd_hold_skb(struct sk_buff *skb)
>
>   /* we have no other options - drop the message */
>   audit_log_lost("kauditd hold queue overflow");
> +drop:
>   kfree_skb(skb);
>  }
>
> If we move skbs from audit_hold_queue to audit_retry_queue, will these skbs be printed
> more than once by printk?

We should never end up moving a record from the hold queue to the
retry queue.  The kauditd_hold_skb() function normally moves the
record from the retry queue to the hold queue when the kernel/daemon
connection is reset, however, in the case of a transient -EAGAIN error
it will instead requeue the record on the retry queue.

That said, you are correct in that there is a scenario where a record
might be printed via printk() more than once; the system would need to
be booted with "audit=1" on the kernel command line, daemon would need
to remain in the stopped state for an extended period of time, and the
retry queue would need to be overflowing.  While it is unfortunate,
this is far from a common case and to resolve that would require
significant modifications[1].

[1] I've actually started on those changes, but as it is part of a
much larger effort it will be some time before it is ready.

-- 
paul moore
paul-moore.com