Kernel audit output is inconsistent, hard to parse

Tue Jan 29 22:56:36 UTC 2008

The format of audit messages from the kernel is a mess. The
bottom line is one cannot parse the audit messages without special
case knowledge of each audit message because the data formatting does
not follow any regular rules.

I don't know how it got this way, but it really needs to be fixed.

The primary offense is string formatting, specifically the use or
non-use of the functions audit_log_hex() audit_log_untrustedstring(),
and audit_log_n_untrustedstring(); depending on circumstances.

The net result is a field value might be one of the following cases:

1) a string without quotes (maybe a string, maybe an int, etc.)

2) a string enclosed in quotes (implies a string with no escaped chars)

3) a string which is represented as a sequence of hex values (not
    enclosed in quotes, but how do you distinguish this from case 1?)

Given the name=value formatting it is absolutely impossible to
correctly interpret the value component unless you know how
audit_log_format was invoked to generate the name=value pair. This
will be dependent on the kernel version, the field name, and the audit
record type. To be specific, during parsing only case 2 is
unambiguous. You cannot determine between case 1 and case
3. Heuristics based on each character being in the hexadecimal
character set fail for a significant subset of data, thus you don't
know if the value is a string encoded in hexadecimal which needs to be
decoded or a string which happens to be composed of hexadecimal
characters but is not encoded.

Thus we have the situation where to correctly parse the name=value
pair one must know the audit record type, the field name, and the
kernel version. That is just plain CRAZY and UNNECESSARY. Trying hide
this logic in auparse is just a band-aid over the problem compounded
by the fact auparse does not always get it right either. This is in
conjunction with the fact auparse has no way to know the kernel
version of the audit data it is attempting to parse (nor should it
even have tables based on kernel version).

The answer is to make the output parsable without special case
knowledge. It would appear many of these problems were introduced with
the functions audit_log_hex() audit_log_untrustedstring(), and
audit_log_n_untrustedstring() which attempt to correct for a double
quote, white space, or non-printable character in the output
string. However these are not used uniformly nor do they follow any
common approach for string representations in user land (why not?).

All field values without exception need to be enclosed in quotes to
delimit the value. Special characters inside the quotes need to
be escaped, following some standard convention. Please, lets not
invent a new encoding, this problem has already been solved
elsewhere many times before!

Also note the function audit_log_n_untrustedstring() in audit.c has a
bug and ignores the len parameter (it iterates till it finds a NULL
terminator even though it's supposed to stop after n chars).

Suggested Fix:
--------------

Most of these problems can easily be fixed if there is exactly one
central place to format an audit field value. The function
audit_log_vformat() could very easily ensure consistent formatting via
% format specifiers in the format string, e.g.:

     audit_log_format("n=%d path=%s", n, path);

Building audit output piecewise would be deprecated, e.g. these types
of sequences would be eliminated.

     audit_log_format(ab, " n=%d", n);
     audit_log_format(ab, " name=");
     audit_log_foo();

and replaced with:

     audit_log_format(ab, " n=%d name=%s", foo_to_string(foo));

Whenever audit_log_vformat() encounters a % format specifier it
formats to a string, then it converts the string to an escaped quoted
string, and then inserts the escaped quoted string into the buffer
(e.g. n="123" name="foo bar\n" )

This way the formatting is consistent, easy to apply, and is never
special cased by the caller.

There are no performance penalties of any note, calling a routine to
escape only needs to be done when the format specifier is %s. Currently
this is already done for a subset of output strings, so all we're doing
is removing the responsibility for escaping from the caller and doing
it consistently instead of in a subset of cases.

I don't really care what the encoding is. I only care that it is an
encoding with wide support. Backslash quoting is very popular,
familiar and has many implementations. The MIME quoted-printable
transfer encoding would be another option but might pose some problems
with line endings. I think backslash quoting would be a good choice. I
suspect everyone reading this message already knows exactly how to
interpret a string with backslash escapes.

Auparse is not the answer:
--------------------------

Auparse is not the answer to irregular kernel audit message
formatting. First of all it forces auparse to have special case logic
which is not 100% robust and is tied to the kernel source code
version.

Second, in it's current implementation auparse confuses transfer
decoding and substitution, two entirely different concepts needing to
be applied in entirely different circumstances, but which have been
conflated.

auparse_get_field_str() returns the field value in it's encoded form,
this is almost never of value to the caller. The caller wants the
field value to be unencoded so it can operate on it. If you want the
field value to be unencoded you have to call
auparse_interpret_field(). But auparse_interpret_field() performs two
distinctly different operations, it both decodes AND performs
contextual substitution. Contextual substitution only has meaning when
applied on the same host and at approximately the same time as when
the audit record was generated. Contextual substitution is mainly of
value for human readable output, it is difficult to utilize with
automated machine processing. At the moment it is not possible to get
a decoded value from auparse without it also performing undesired
substitution.

While we're at it:
------------------

If we do fix the format of audit messages we might as well fix some
other inconsistencies at the same time.

1) The initial part of AVC messages do not follow the standard
    name=value formatting used everywhere else in audit.

    a) It includes the string "avc:" which is redundant with the audit
    record type (e.g. type=AVC), the string "avc:" should be removed,
    it serves no purpose and only makes parsing much harder because of
    the inconsistency.

    b) denied|granted are bare words without a field name, it should be
    seresult="denied", once again to avoid special case parsing.

    c) The list of operations are enclosed in curly braces {} without a
    field name, this should be seperms=xxx, where xxx is a list. The
    use of curly braces to encode a list in audit data is unique. We
    should define how any audit message should encode a list of values
    and use that consistently for all audit data. While one could
    define a syntax such as "[value1, value2]" or some such, it might
    be informative to look at how other transfer mechanisms such as
    structured markup and ldap handle this case. They both utilize the
    concept of multi-valued attributes. Thus there is no list
    structure, but an attribute is allowed to repeat itself and in the
    process implicitly creates a list of values for the attribute. Thus
    {read write} might be represented as seperms="read"
    seperms="write". This regularity makes parsing much easier, it
    avoids special case syntax.

2) (Note, this is not a kernel issue) The host data is currently
    prepended to the audit record with the format host=xxx. Is this an
    encoded string or not? It should be encoded and it should be
    encoded in exactly the same format as the name/value pairs in the
    audit records. The same holds true for the record type, it should
    follow the same syntax as every other name/value pair.

3) The string "audit(ssssss.mmmm:iiii):" is a critical delimiter, it
    separates record properties (e.g. host, type, timestamp) from
    record data, which must be a sequence of name="value" pairs. But
    the time stamp should really follow the name/value pair encoding
    used elsewhere.

Desired syntax:
---------------

Records consist of a sequence of name="value" pairs.

Ordering of name/value pairs is significant for multi-valued
attributes (i.e. where name appears more than once), insignificant
otherwise.

The value MUST be enclosed in double quotes with interior characters
properly escaped.

White space between name, '=', and "value" is insignificant and
ignored.

The audit record is partitioned into two parts

   a) record properties (i.e. host, record type, timestamp)

   b) record data

The partition of properties and data occurs at a colon delimiter,
i.e. properties : data

The current formatting of the record timestamp
(e.g. audit(ssss.mmm:iii) is inconsistent with
all other name/value pairs. It should be "seconds="sss"
milliseconds="mmm" serial="iii", this allows parsing to be regular and
consistent.

Thus an audit record with consistent syntax would look like this,
where brackets [] indicate optional components:

[host=""] type="" seconds="" milliseconds="" serial="" : name="" [name=""]

What has to change and what's optional:
---------------------------------------

The formatting of name/value pairs in the kernel must be fixed, it is
simply impossible to correctly parse in it's current state.

The rest of the suggested changes are syntactic sugar which would make
parsing easier because of regular syntax, but they are not
critical. We could retain the existing formats if backwards
compatibility is felt to trump syntactic cleanliness and ease in
parsing. It's a judgment call over when and how to introduce change
and the anticipated impact.

-- 
John Dennis <jdennis at redhat.com>