Audit for live supervision

Fri Aug 15 06:43:49 UTC 2008

Hello Steve,

thanks for your reply, very helpful. :-)

> > Now one issue, I see is that the times that we get from auditd through
> > the socket from its child daemon may not match the start_date exactly.
>
> All time hacks in the audit logs come from the kernel at the instant the
> record is created. They all start by calling audit_log_start, and right
> here is where time is written:
>
> http://lxr.linux.no/linux+v2.6.26.2/kernel/audit.c#L1194
>
> The source that is used is current_kernel_time();

[...]

> The audit system has millisecond resolution.This was considered adequate
> due to system ticks being < 1000 Hz. The current_kernel_time90 is a broken
> down time struct similar to pselect's. This is how its used:
>
>  audit_log_format(ab, "audit(%lu.%03lu:%u): ",
> 			t.tv_sec, t.tv_nsec/1000000, serial);

>From what I checked, it seems that current_kernel_time() is indeed fed exactly 
by the jiffy/system ticks since boot (at least did I find comments that 
suggest so). I still have to verify how it is translated. There still is the 
issue of translating ticks into seconds since 1970. So far I have only 
achieved to get hands on system boot time in a granularity of one second.

I have no clue if that is the same time used in the kernel to offset the ticks 
value. I will make a delta test once I can. But suspect would be that the 
real time clock has better time than one second, which is all that I get 
from /proc/stat btime: field.

More importantly, and somewhat blocking my tests: With the improved rules I 
get this when compiling quite well reproducible:

type=SYSCALL msg=audit(1218773075.500:118620): arch=c000003e syscall=59 
success=yes exit=0 a0=7fff6f78cf90 a1=7fff6f78cf40 a2=7fff6f78f068 a3=0 
items=2 pp
id=11412 pid=11421 auid=4294967295 uid=1000 gid=1000 euid=1000 suid=1000 
fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts3 ses=4294967295 
comm="gcc-4.3"
exe="/usr/bin/gcc-4.3" key=(null)

[...]
type=SYSCALL msg=audit(1218773075.496:118624): arch=c000003e syscall=56 
success=yes exit=11421 a0=1200011 a1=0 a2=0 a3=7fc067776770 items=0 
ppid=11407 pid
=11412 auid=4294967295 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 
egid=1000 sgid=1000 fsgid=1000 tty=pts3 ses=4294967295 comm="gnatchop" 
exe="/usr/b
in/gnatchop" key=(null)

Please note the _ascending_ sequence number but _descending_ time. This is 
pasted from my audit.log and quite surprising. It triggers an assertion in 
our code, because we also seem receive things in the wrong order from the 
socket. There was not FORK before the EXECVE that refers to the pid. We 
tolerate that (obviously there are going to be processes that we didn't see 
fork once we start, we don't do an initial scan yet), but we don't tolerate 
that fork returns an existing pid. 

Seems like a bug? Can you have a look at it? I was using for that:

-a entry,always -F arch=b32 -S clone -S fork -S vfork
-a entry,always -F arch=b64 -S clone -S fork -S vfork
-a entry,always -S execve
-a entry,always -S exit_group -S exit

I didn't apply the arch to exit, etc. yet, but there wasn't an pid rollover 
yet, so I don't think that missing an exit is really the issue here. 

Plus I still did't fully grasp why that arch filter was necessary in the first 
place. I mean, after all, I was simply expecting that per default no filter 
should give all arches. Is that filter actually a selector? Does it have to 
do with the fact that syscall numbers are arch dependent?

> > The other thing is sequence numbers. We see in the output sequence
> > numbers for each audit event. Very nice. But can you confirm where these
> > sequence numbers are created? Are they done in the kernel, in auditd or
> > in its child daemon?
>
> They are done in the kernel and are incremented for each audit_log_start so
> that no 2 audit events within the same millisecond have the same serial
> number. Their source is here:
>
> http://lxr.linux.no/linux+v2.6.26.2/kernel/audit.c#L1085

Thanks for the pointer. That looks indeed like nothing can go wrong on the 
reliability side of time and serial. The more astounding my report from above 
is.

> > The underlying question is, how safe can we be that we didn't miss
> > anything when sequence numbers don't suggest so. We would like to use the
> > lossless mode of auditd. Does that simply mean that auditd may get behind
> > in worst case?
>
> Yes. You would want to do a couple things. Increase the kernel backlog,
> increase auditd priority, & increase audispd's internal queue.

That's fine with us. We just wouldn't want to be in an inconistent state after 
a load peek. I think with a decicated core for our supervision processes, 
everything except benchmarks of clone performance (aka fork bombs) will be 
able to trigger any actual issue.

> > Can you confirm that a type=EOE delimits every event (is that even
> > the correct term to use, audit trace, how is it called).
>
> It delimits every multipart event. you can use something like this to
> determine if you have an event:
>
	> if ( r->type == AUDIT_EOE || r->type < AUDIT_FIRST_EVENT ||
>                                 r->type >= AUDIT_FIRST_ANOM_MSG) {
>   have full event...
> }

I will have to check if this affects our intended process tracing. The parsing 
is certainly not simplified by it, for a possibly unrelated reason.

We read from a socket, and in data chunks, which are then processed. For that, 
we want to identify the end of a message before processing it. Right now, we 
have switched to waiting for type=EOE and its new line. Without that, the 
fact that type=non-EOE may be a multi-line thing. I think parameters are on 
new lines and that makes it a bit hard to detect the end of a complete type= 
trace. I don't see that in the audit.log, so probably it's a problem of the 
sub-daemon:

 type=EXECVE msg=audit(1218774568.173:129440): argc=4 a0="/bin/sh"
a1="..."
a2="..."
a3="..."

Without a very stateful message parser, one that e.g. knows how many lines are 
to follow an EXECVE, we don't know when to forward it the part that should 
process it.

What we first, once we got a message is the following code:
        # 1. Some lines are split across multiple lines. The good thing is 
that these never start
        #    with whitespace and so we can make them back into single lines. 
This makes the next
        #    part easier.

        lines = []

        for line in message.split( "\n" ):
            if line.strip() == "":
                pass
            elif line.startswith( " type=" ):
                lines.append( line )
            else:
                assert line[0] != ' '

                lines[-1] = lines[-1] + ' ' + line

This is in hope that indeed continued lines always start with a non-space and 
type lines always start with a space. Would you consider this format worthy 
and possible to change? 

I have no idea how much it represents and existing external interface, but I 
can imagine you can't change it (easily). Probably the end of type= must be 
detected by terminating empty line in case of those that can be continued. 
But it would be very ugly to have to know the event types that have this so 
early in the decoding process.

> > BTW: Release-wise, will RHEL 5.3 include the latest auditd?
>
> That is the plan. But there will be a point where audit development
> continues and bugfixes are backported rather than new version. At a
> minimum, audit-1.7.5 will be in RHEL5.3. Maybe 1.7.6 if we have another
> quick release.

That's OK for us. And until then self-compiled will do.

> If you are on amd64, I would suggest:
>
> -a entry,always -F arch=b32 -S clone -S fork -S vfork
> -a entry,always -F arch=b64 -S clone -S fork -S vfork

Actually that solved the trouble of not seeing anything on "Debian". The fact 
is that we are using RHEL x86 for production and Debian (or Ubuntu) amd64 on 
our development machines. So we never checked RHEL amd64, which likely would 
show the same thing.

But see above, this possibly a bug?

> There might be tunables that different distros can used with glibc. strace
> is your friend...and having both 32/64 bit rules if amd64 is the target
> platform.

We did that of course. And what was confusing us was that the audit.log did 
actually seem to show the calls. Can that even be?

> > Does audit not  (yet?) use other tracing interface like SystemTap, etc.
> > where people try to have 0 cost for inactive traces.
>
> They have a cost. :)  Also, systemtap while good for some things not good
> for auditing. For one, systemtap recompiles the kernel to make new modules.
> You may not want that in your environment. It also has not been tested for
> CAPP/LSPP compilance.
>
> > Also on a general basis. Do you recommend using the sub-daemon for the
> > job or should we rather use libaudit for the task instead? Any insight is
> > welcome here.
>
> It really depends on what your environment allows. Do you need an audit
> trail? With search tools? And reporting tools? Do you need the system to
> halt if auditing problems occur? Do you need any certifications?

I see. Luckily we are not into security, but only "safety". I can't find 
anything on Wikipedia about it, so I will try to explain it briefly, please 
forgive my limited understanding of it. :-)

Increasingly we in the ATC need to provide a "Software Assurance Level" (SWAL) 
to our customers and those to their regulators. That's one of many standards 
to define "safety". Depending on the level, you may not only need to document 
the requirements of your software or you have to relate each line of code 
with them and ultimately prove the correctness of compiler output. I believe 
they do this for fighter planes.

We currently target level 3, which means that we will have to prove that we 
have concerned ourselves with all possible hazards and their combinations and 
if that would lead to "safety" relevant things. For the running system, we 
don't have to prove anything. There are "legal recordings" of the input and 
output data, but no complete logs of the operation are required per se, 
although of course, as we support the analysis of trouble reports, that is 
ver welcome. That level is acceptable for non-airborne systems.

So if we are running and monitoring a system, and if e.g. certain process must 
run to prevent planes from colliding, we must provide explanations (and 
tests) of why we are at any time sure _if_ (not "that") the process is 
running or how the problem will be noted by the software and ultimately the 
operators through other means, if that fails.

On a general level, redundancy is normally the solution to hazards. Everything 
is doubled, and you normally have 2 trackers, etc. so you use not only one 
software, but multiple implementations. 

When something fails, we generally stop the software unless we are sure it's 
an isolated event. That is because data corruption is worse than no data at 
all. An unseen flight or a falsely reported flight is a lot worse than a 
crashed system that only needs a restart and will be back in operation within 
time limits.

Our concern is not at all to prevent or detect malicious use of the system. If 
somebody wants to run a binary in a hidden way and takes effort to achieve 
that, other processes have failed already. Our systems always run in an 
environment where security is already solved through other means. The systems 
e.g. do allow rsh login, without a password, as root. I hope you can continue 
breathing now. :-)

It certainly will be very helpful to have the audit log and it searchable and 
I understand we get that automatic by leaving audit enabled, but configured 
correctly. In the past we have disabled it, because it caused a full disk and 
boot failure on RHEL 3 after only a month or so. I think it complained about 
the UDP echo packets that we use to check our internal LAN operations, but it 
could have been SELinux too.

So, what we will do with audit is to look at source code, identify design 
principles that make errors impossible, write that down, define our 
requirements of it, test these. And ultimately we may have to provide an 
alternative implementation without audit at all, with probably worse latency, 
and the ability to detect and compensate (simulated) audit bugs. And test 
that as well.

And in some figure, we will say: If the tracker process exits, we will detect 
that on avarage after 1 ms, and at latest after 50ms. And that will play a 
role in the time it takes to switch to the standby tracker process on another 
hardware. And that will e.g. have the requirement to be able to take over in 
1 or 3 seconds, and that will be possible, because the process that hands 
over started early enough and takes itself a limited time.

The ultimate benefits of auditd would be to us:

a) Achieve very low values in latency, being able to observe an "exit" as it 
happens, not just when a periodic timer makes a test. We can only achieve 
these kinds of performance through child process SIGCHLD so far and signals 
are not an ideal interface. Being able to monitor non-childs is very good for 
us, actually what made us interested in audit in the first place.

b) Robust operation. In theory the audit approach should be able to receive 
assurances that certain hazards just cannot happen that's very nice and 
certainly increases the assurances we can give for our process supervision, a 
key stone building part for every system.

c) Our middleware may suddenly offer to monitor processes like ntpd, cron 
jobs, etc. very easily and without system changes. There are systems that 
work with "ps", "grep" and "kill" to achieve this, but as you can imagine, 
that only goes so far.

Security, in the sense of intrusion detection, authorization, etc. doesn't 
play a role to us. So we don't need audit trails for our live supervision, 
we "only want to know" what was running, what is running.

> > 2. We don't want to poll periodically, but rather only wake up (and then
> > with minimal latency) when something interesting happened. We would want
> > to poll a periodic check that forks are still reported, so we would
> > detect a loss of service from audit.
>
> You might write a audispd plugin for this.

Did you mean for the periodic check, or for the whole job, that means our 
supervision process? We certainly would prefer the plugin approach for such a 
test, esp. if there is hope that you accept it into your code. 

The closing of the socket in case of loss of service would be sufficient 
signal to us.

Regarding performance I would like to say, you are likely right in that it's a 
non-issue. It has something of a bike-shed to me though. :-) I think I still 
have http://lwn.net/Articles/290428/ on my mind, where I had the impression 
that kernel markers would only require a few noop instructions as place 
holders for a jumps that would cause audit code to run. I was wondering why 
audit wouldn't use that. Is that historic (didn't exist, nobody made a patch 
for it) or conscious decision (too difficult, not worth it). Just curious 
here and of course the comment could be read as a bit scary, because it 
actually means we will have to benchmark the impact...

Best regards,
Kay Hayen