Auditd errors on busy hosts when rolling over log files

Tue Nov 5 13:59:51 UTC 2013

Hello,

On Tuesday, November 05, 2013 10:07:08 PM Burn Alting wrote:
> I did a little experimentation today.
> 
> On a system that generates around 7500 audit events every five minutes I
> changed, without success, the following:
> 
> In auditd.conf
> - changed num_logs from 9 to 5 although I didn't expect a change as I
> move out the rolled over (audit.log.?) log files as part of the
> processing so there shouldn't be a big file rename impost

This should have helped a little since you dropped 4 syscalls.

> - changed priority_boost from 4 to 8
> 
> In audit.rules
> - changed backlog from 32K to 64K to 96K to 128K

This should only help to the extent of your constant fill rate. What happens is 
your events are coming in and auditd is unable to attend to them during the 
rotation because it has to start with audit.log.9 and delete it, then move all 
logs up one number leaving no audit.log. At that point it can open a new one. 
So, the backlog needs to be big enough to handle the overflow during that brief 
time. 

I would expect rotation takes 10 milliseconds at the most. But just for the 
sake of argument, let's say it took 1 whole second. At your fill rate, you 
should be receiving 25 events. Some of these events may be compound, meaning 
they have support records besides syscall such as PATH or CWD. Let's assume 
you have 4 supporting records per event. You now have 100 incoming events 
during that one second. It would sound like setting the backlog to 32k should 
be sufficient...unless the system is about to fallover anyways.

You might try running:

while true; do auditctl -s; sleep 5; done

and see if your system is never able to catch up. If that's the case, you need 
to do something about the audit daemon's priority or scheduling. You can boost 
the priority way up. 20. You might even add the 'chrt' command to the 
initscript to see if you can put auditd on a different scheduler.

> - changed rules to reduce the recorded events per 5 minute interval from
> 7500 to 500-600 for the same period.

That should help both the backlog before rotation as well as the fill rate 
during rotation.

> This particular system is running audit-1.8.2-el5 but I see a similar
> problem on a RHEL 6.4 box which I believe is running audit-2.2-2.el6.

I think there was one change to normal processing that saved a syscall to stat 
the disk and just do arithmetic instead. I don't know if that one patch would 
help or not. It would allow auditd to keep the backlog lower prior to 
rotation.

> I did note that if I executed the sync(1) command before signaling
> auditd to roll over (ie execute /bin/kill -s USR1 pid) the error
> SOMETIMES did not appear.
> 
> So I am a little bit lost.

You might also experiment with the disk flushing in auditd.conf.

> I believe that the actual effect is just
> - the cost of two additional lines in /var/log/messages
> - the loss a few logs
> 
> My actual process is to
> a. roll over the log file
> b. run an ausearch --interpret like command

Running the command shouldn't interfere.

> Perhaps my alternative is to modify my ausearch-like command to be state
> full and have it process only new events as per a patch I made to
> ausearch some time back
> 
>         Subject: 	[PATCH] ausearch: Add checkpoint capability and have
>         incomplete logs carry forward when processing multiple audit.log
>         files
>         Date: 	05/11/2013 03:59:34 PM
> 
> 
> Am open to any suggestions ... I think the key issue is that I reduced
> the generated commands into audit.log from 7500 to 600 per five minute
> interval but I still see the error.

I think its several things. Dropping the fill rate will help. But something 
else is going on. Maybe some of these hints can help you investigate the 
problem.

-Steve

> > On Monday, November 04, 2013 07:46:18 PM Burn Alting wrote:
> > > Hi,
> > > 
> > > I have some quite busy hosts, that emit the following errors when I
> > > request the audit log file is rolled over (via a kill -s USR1
> > > auditdpid).
> > > 
> > >   Error receiving audit netlink packet(No buffer space available)
> > >   Error sending signal_info request (No buffer space available)
> > > >
> > > >From reading earlier posts (circa 2009) it would appear my options are
> > > 
> > > a. Increase backlog buffer (currently 32768)
> > > b. Increase priority_boost (currently 4)
> > > c. Reduce the number of log files (currently 9)
> > 
> > Another corollary to this is that you can increase the file size and
> > decrease the total files which would help on rotation.
> > 
> > > Does anyone have a feel for which of the above should offer the best
> > > return?
> > 
> > There are 2 more options:
> > 
> > 1) Review the rules to make sure you are not getting events that you
> > really do not need. If you have a lot of false positives, then you might
> > add some arguments that better narrow the results. For example, perhaps
> > you have this rule:
> > 
> > -a always,exit -F arch=b64 -S clock_settime -k time-change
> > 
> > This can give a lot of false positives. The one that really matters is
> > when a program sets CLOCK_REALTIME (the wall clock). So, the rule can be
> > re-written as:
> > 
> > -a always,exit -F arch=b64 -S clock_settime -F a0=0 -k time-change
> > 
> > which narrows its scope.
> > 
> > 2) You might experiment with cgroups.
> > 
> > > Are their other configuration parameters I could adjust (aside from
> > > changing my ruleset in audit.rules)?
> > 
> > There might be general disk tuning parameters in sysctl that could help as
> > well. Choice of file system also has performance impacts. I haven't done
> > any experimenting on the performance side, but I know there are people
> > here that also have very busy systems.
> > 
> > -Steve